I host raw vm images on: Volume Name: vmimages Type: Distributed-Replicate Status: Started Number of Bricks: 4 x 3 = 12 Transport-type: tcp Bricks: Brick1: ewcs2:/var/spool/glusterfs/a_vmimages Brick2: ewcs4:/var/spool/glusterfs/a_vmimages Brick3: ewcs7:/var/spool/glusterfs/a_vmimages Brick4: ewcs2:/var/spool/glusterfs/b_vmimages Brick5: ewcs4:/var/spool/glusterfs/b_vmimages Brick6: ewcs7:/var/spool/glusterfs/b_vmimages Brick7: ewcs2:/var/spool/glusterfs/c_vmimages Brick8: ewcs4:/var/spool/glusterfs/c_vmimages Brick9: ewcs7:/var/spool/glusterfs/c_vmimages Brick10: ewcs2:/var/spool/glusterfs/d_vmimages Brick11: ewcs4:/var/spool/glusterfs/d_vmimages Brick12: ewcs7:/var/spool/glusterfs/d_vmimages After replacing ewcs7:/var/spool/glusterfs/c_vmimages due to a disk failure, I ran a stat() against the files (vm images) in that volume. These VM's were live at the time. The background self-heal was scheduled and completed. The extended attributes, however, showed pending. before: /var/spool/glusterfs/d_vmimages/intranet.img trusted.afr.vmimages-client-11=0x0000accd0000000000000000 log: [2012-05-11 08:24:28.756410] I [afr-common.c:790:afr_lookup_done] 1-vmimages-replicate-3: background data self-heal triggered. path: /intranet.img [2012-05-11 08:30:10.230581] I [afr-self-heal-algorithm.c:532:sh_diff_loop_driver_done] 1-vmimages-replicate-3: diff self-heal on /intranet.img: 56 blocks of 49152 were different (0.11%) [2012-05-11 08:30:10.459175] I [afr-self-heal-common.c:1536:afr_self_heal_completion_cbk] 1-vmimages-replicate-3: background data self-heal completed on /intranet.img after: /var/spool/glusterfs/d_vmimages/intranet.img trusted.afr.vmimages-client-11=0x000000f80000000000000000 Trying again with client-log level debug led me to this anomaly: [2012-05-11 09:35:04.900303] D [afr-self-heal-data.c:573:afr_sh_data_sync_prepare] 1-vmimages-replicate-3: self-healing file /intranet.img from subvolume vmimages-client-9 to 1 other Why is it self-healing from vmimages-client-9 when it's vmimages-client-11 that's dirty?
Created attachment 583940 [details] client state dump
Created attachment 583941 [details] ewcs2:/var/share/glusterd/d_vmimages state dump
Created attachment 583942 [details] ewcs4:/var/share/glusterd/d_vmimages state dump
Created attachment 583943 [details] ewcs7:/var/share/glusterd/d_vmimages state dump
We think that it's possible that the changed drive may be a red herring. I didn't start looking for dirty attributes until that happened, but I did down ewcs7 for an upgrade on the 3rd, so it's possible that's when it started. intranet.img isn't the only file that's showing this behavior. Here are the non-zero xattrs remaining: ewcs{2,4}: # file: var/spool/glusterfs/a_vmimages/blog.img trusted.afr.vmimages-client-2=0x000036910000000000000000 # file: var/spool/glusterfs/a_vmimages/mysql1.img trusted.afr.vmimages-client-2=0x0000ab3b0000000000000000 # file: var/spool/glusterfs/c_vmimages/bazaar.img trusted.afr.vmimages-client-8=0x00002d720000000000000000 # file: var/spool/glusterfs/d_vmimages/intranet.img trusted.afr.vmimages-client-11=0x0000273a0000000000000000 # file: var/spool/glusterfs/d_vmimages/squid1.img trusted.afr.vmimages-client-11=0x00009d510000000000000000 # file: var/spool/glusterfs/d_vmimages/web2.img trusted.afr.vmimages-client-11=0x00009ff00000000000000000 ewcs7 is all 0 I also realized that these may be relevant: [2012-05-11 09:32:55.844205] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD [2012-05-11 09:32:55.844807] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD [2012-05-11 09:32:55.844916] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction [2012-05-11 09:32:55.845259] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction [2012-05-11 09:32:55.882763] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction [2012-05-11 09:32:55.882928] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking [2012-05-11 09:32:55.882981] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking [2012-05-11 09:32:55.883365] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD [2012-05-11 09:32:55.883386] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking [2012-05-11 09:32:55.883402] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP [2012-05-11 09:32:56.119427] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.119641] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction [2012-05-11 09:32:56.120151] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction [2012-05-11 09:32:56.120334] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction [2012-05-11 09:32:56.120450] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.120487] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking [2012-05-11 09:32:56.120503] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP [2012-05-11 09:32:56.120721] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking [2012-05-11 09:32:56.121838] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.121894] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking [2012-05-11 09:32:56.121919] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP [2012-05-11 09:32:56.252114] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.252441] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction [2012-05-11 09:32:56.252876] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking [2012-05-11 09:32:56.252958] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction [2012-05-11 09:32:56.253320] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.253343] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking [2012-05-11 09:32:56.253367] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP [2012-05-11 09:32:56.314070] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.314354] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction [2012-05-11 09:32:56.314827] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking [2012-05-11 09:32:56.315231] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD [2012-05-11 09:32:56.315253] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking [2012-05-11 09:32:56.315269] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
Seems like re-open of the fd after you brought the brick back up did not work fine. It keeps giving EBADFD for the fd. That could be the reason for why the sync is never complete. Writes are not happening on the new brick. How did you perform the disk replace? Did you bring the brick down?
I just found that there is a problem in re-open of a file in case of disconnect and re-connect, in client xlator. If at the time of re-open after the brick reconnects, if the file is not present on the brick (since the disk was replaced I am assuming there are no files present on the disk at the time of bringing the brick up) the re-open fails and further operations that happen on the file using that fd will now happen on only one brick (assuming 2 replica). So the file will never be in sync as long as the fd is open. With hosting VMs this a problem.
That sounds like that's it. When the drive failed, I killed glusterfsd for that brick, replaced the drive (partitioned, formatted, etc.), then resterted the brick with "gluster volume start vmimages force".
CHANGE: http://review.gluster.org/4357 (protocol/client: Add fdctx back to saved-list after reopen) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4358 (protocol/client: Periodically attempt reopens) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4386 (tests: Added util functions) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4387 (test: test re-opens of open-fds) merged in master by Anand Avati (avati)
CHANGE: http://review.gluster.org/4464 (protocol/client: Avoid double free of frame) merged in master by Anand Avati (avati)
Please backport to release-3.3
CHANGE: http://review.gluster.org/4540 (Tests: Disable open-behind) merged in master by Anand Avati (avati)