Bug 821056

Summary: file does not remain in sync after self-heal
Product: [Community] GlusterFS Reporter: Joe Julian <joe>
Component: replicateAssignee: Pranith Kumar K <pkarampu>
Status: CLOSED CURRENTRELEASE QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 3.1.7CC: gluster-bugs, jdarcy, rfortier, vbellur
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: glusterfs-3.4.0 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 853684 (view as bug list) Environment:
Last Closed: 2013-07-24 13:35:53 EDT Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Bug Depends On:    
Bug Blocks: 853684, 858501, 895528    
Attachments:
Description Flags
client state dump
none
ewcs2:/var/share/glusterd/d_vmimages state dump
none
ewcs4:/var/share/glusterd/d_vmimages state dump
none
ewcs7:/var/share/glusterd/d_vmimages state dump none

Description Joe Julian 2012-05-11 13:28:06 EDT
I  host raw vm images on:
Volume Name: vmimages
Type: Distributed-Replicate
Status: Started
Number of Bricks: 4 x 3 = 12
Transport-type: tcp
Bricks:
Brick1: ewcs2:/var/spool/glusterfs/a_vmimages
Brick2: ewcs4:/var/spool/glusterfs/a_vmimages
Brick3: ewcs7:/var/spool/glusterfs/a_vmimages
Brick4: ewcs2:/var/spool/glusterfs/b_vmimages
Brick5: ewcs4:/var/spool/glusterfs/b_vmimages
Brick6: ewcs7:/var/spool/glusterfs/b_vmimages
Brick7: ewcs2:/var/spool/glusterfs/c_vmimages
Brick8: ewcs4:/var/spool/glusterfs/c_vmimages
Brick9: ewcs7:/var/spool/glusterfs/c_vmimages
Brick10: ewcs2:/var/spool/glusterfs/d_vmimages
Brick11: ewcs4:/var/spool/glusterfs/d_vmimages
Brick12: ewcs7:/var/spool/glusterfs/d_vmimages

After replacing ewcs7:/var/spool/glusterfs/c_vmimages due to a disk failure, I ran a stat() against the files (vm images) in that volume. These VM's were live at the time. The background self-heal was scheduled and completed. The extended attributes, however, showed pending.

before:
/var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x0000accd0000000000000000

log:
[2012-05-11 08:24:28.756410] I [afr-common.c:790:afr_lookup_done] 1-vmimages-replicate-3: background  data self-heal triggered. path: /intranet.img
[2012-05-11 08:30:10.230581] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 1-vmimages-replicate-3: diff self-heal on /intranet.img: 56 blocks of 49152 were different (0.11%)
[2012-05-11 08:30:10.459175] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 1-vmimages-replicate-3: background  data self-heal completed on /intranet.img

after:
/var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x000000f80000000000000000

Trying again with client-log level debug led me to this anomaly:
[2012-05-11 09:35:04.900303] D [afr-self-heal-data.c:573:afr_sh_data_sync_prepare] 1-vmimages-replicate-3: self-healing file /intranet.img from subvolume vmimages-client-9 to 1 other

Why is it self-healing from vmimages-client-9 when it's vmimages-client-11 that's dirty?
Comment 1 Joe Julian 2012-05-11 18:10:00 EDT
Created attachment 583940 [details]
client state dump
Comment 2 Joe Julian 2012-05-11 18:10:54 EDT
Created attachment 583941 [details]
ewcs2:/var/share/glusterd/d_vmimages state dump
Comment 3 Joe Julian 2012-05-11 18:11:31 EDT
Created attachment 583942 [details]
ewcs4:/var/share/glusterd/d_vmimages state dump
Comment 4 Joe Julian 2012-05-11 18:11:57 EDT
Created attachment 583943 [details]
ewcs7:/var/share/glusterd/d_vmimages state dump
Comment 5 Joe Julian 2012-05-11 18:20:06 EDT
We think that it's possible that the changed drive may be a red herring. I didn't start looking for dirty attributes until that happened, but I did down ewcs7 for an upgrade on the 3rd, so it's possible that's when it started.

intranet.img isn't the only file that's showing this behavior. Here are the non-zero xattrs remaining:

ewcs{2,4}:
# file: var/spool/glusterfs/a_vmimages/blog.img
trusted.afr.vmimages-client-2=0x000036910000000000000000

# file: var/spool/glusterfs/a_vmimages/mysql1.img
trusted.afr.vmimages-client-2=0x0000ab3b0000000000000000

# file: var/spool/glusterfs/c_vmimages/bazaar.img
trusted.afr.vmimages-client-8=0x00002d720000000000000000

# file: var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x0000273a0000000000000000

# file: var/spool/glusterfs/d_vmimages/squid1.img
trusted.afr.vmimages-client-11=0x00009d510000000000000000

# file: var/spool/glusterfs/d_vmimages/web2.img
trusted.afr.vmimages-client-11=0x00009ff00000000000000000

ewcs7 is all 0

I also realized that these may be relevant:
[2012-05-11 09:32:55.844205] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844807] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844916] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.845259] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882763] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882928] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.882981] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.883365] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.883386] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:55.883402] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.119427] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.119641] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:56.120151] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120334] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120450] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.120487] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:56.120503] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.120721] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.121838] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.121894] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.121919] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.252114] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.252441] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.252876] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.252958] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.253320] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.253343] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.253367] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.314070] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.314354] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.314827] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.315231] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.315253] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.315269] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
Comment 6 Pranith Kumar K 2012-06-13 04:56:36 EDT
Seems like re-open of the fd after you brought the brick back up did not work fine. It keeps giving EBADFD for the fd. That could be the reason for why the sync is never complete. Writes are not happening on the new brick. How did you perform the disk replace? Did you bring the brick down?
Comment 7 Pranith Kumar K 2012-06-13 05:29:17 EDT
I just found that there is a problem in re-open of a file in case of disconnect and re-connect, in client xlator. If at the time of re-open after the brick reconnects, if the file is not present on the brick (since the disk was replaced I am assuming there are no files present on the disk at the time of bringing the brick up) the re-open fails and further operations that happen on the file using that fd will now happen on only one brick (assuming 2 replica). So the file will never be in sync as long as the fd is open. With hosting VMs this a problem.
Comment 8 Joe Julian 2012-06-13 10:09:11 EDT
That sounds like that's it.  When the drive failed, I killed glusterfsd for that brick, replaced the drive (partitioned, formatted, etc.), then resterted the brick with "gluster volume start vmimages force".
Comment 9 Vijay Bellur 2013-02-03 14:34:14 EST
CHANGE: http://review.gluster.org/4357 (protocol/client: Add fdctx back to saved-list after reopen) merged in master by Anand Avati (avati@redhat.com)
Comment 10 Vijay Bellur 2013-02-03 14:34:31 EST
CHANGE: http://review.gluster.org/4358 (protocol/client: Periodically attempt reopens) merged in master by Anand Avati (avati@redhat.com)
Comment 11 Vijay Bellur 2013-02-03 14:35:09 EST
CHANGE: http://review.gluster.org/4386 (tests: Added util functions) merged in master by Anand Avati (avati@redhat.com)
Comment 12 Vijay Bellur 2013-02-03 14:36:03 EST
CHANGE: http://review.gluster.org/4387 (test: test re-opens of open-fds) merged in master by Anand Avati (avati@redhat.com)
Comment 13 Vijay Bellur 2013-02-05 01:12:05 EST
CHANGE: http://review.gluster.org/4464 (protocol/client: Avoid double free of frame) merged in master by Anand Avati (avati@redhat.com)
Comment 14 Joe Julian 2013-02-18 12:30:46 EST
Please backport to release-3.3
Comment 15 Vijay Bellur 2013-02-19 19:09:22 EST
CHANGE: http://review.gluster.org/4540 (Tests: Disable open-behind) merged in master by Anand Avati (avati@redhat.com)