Bug 821056
| Summary: | file does not remain in sync after self-heal | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Joe Julian <joe> | ||||||||||
| Component: | replicate | Assignee: | Pranith Kumar K <pkarampu> | ||||||||||
| Status: | CLOSED CURRENTRELEASE | QA Contact: | |||||||||||
| Severity: | high | Docs Contact: | |||||||||||
| Priority: | unspecified | ||||||||||||
| Version: | 3.1.7 | CC: | gluster-bugs, jdarcy, rfortier, vbellur | ||||||||||
| Target Milestone: | --- | Keywords: | Triaged | ||||||||||
| Target Release: | --- | ||||||||||||
| Hardware: | Unspecified | ||||||||||||
| OS: | Unspecified | ||||||||||||
| Whiteboard: | |||||||||||||
| Fixed In Version: | glusterfs-3.4.0 | Doc Type: | Bug Fix | ||||||||||
| Doc Text: | Story Points: | --- | |||||||||||
| Clone Of: | |||||||||||||
| : | 853684 (view as bug list) | Environment: | |||||||||||
| Last Closed: | 2013-07-24 17:35:53 UTC | Type: | Bug | ||||||||||
| Regression: | --- | Mount Type: | --- | ||||||||||
| Documentation: | --- | CRM: | |||||||||||
| Verified Versions: | Category: | --- | |||||||||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
| Embargoed: | |||||||||||||
| Bug Depends On: | |||||||||||||
| Bug Blocks: | 853684, 858501, 895528 | ||||||||||||
| Attachments: |
|
||||||||||||
|
Description
Joe Julian
2012-05-11 17:28:06 UTC
Created attachment 583940 [details]
client state dump
Created attachment 583941 [details]
ewcs2:/var/share/glusterd/d_vmimages state dump
Created attachment 583942 [details]
ewcs4:/var/share/glusterd/d_vmimages state dump
Created attachment 583943 [details]
ewcs7:/var/share/glusterd/d_vmimages state dump
We think that it's possible that the changed drive may be a red herring. I didn't start looking for dirty attributes until that happened, but I did down ewcs7 for an upgrade on the 3rd, so it's possible that's when it started.
intranet.img isn't the only file that's showing this behavior. Here are the non-zero xattrs remaining:
ewcs{2,4}:
# file: var/spool/glusterfs/a_vmimages/blog.img
trusted.afr.vmimages-client-2=0x000036910000000000000000
# file: var/spool/glusterfs/a_vmimages/mysql1.img
trusted.afr.vmimages-client-2=0x0000ab3b0000000000000000
# file: var/spool/glusterfs/c_vmimages/bazaar.img
trusted.afr.vmimages-client-8=0x00002d720000000000000000
# file: var/spool/glusterfs/d_vmimages/intranet.img
trusted.afr.vmimages-client-11=0x0000273a0000000000000000
# file: var/spool/glusterfs/d_vmimages/squid1.img
trusted.afr.vmimages-client-11=0x00009d510000000000000000
# file: var/spool/glusterfs/d_vmimages/web2.img
trusted.afr.vmimages-client-11=0x00009ff00000000000000000
ewcs7 is all 0
I also realized that these may be relevant:
[2012-05-11 09:32:55.844205] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844807] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.844916] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.845259] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882763] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:55.882928] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.882981] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-0: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:55.883365] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (144): failed to get fd ctx. EBADFD
[2012-05-11 09:32:55.883386] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:55.883402] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.119427] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.119641] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-0: lk op is for a transaction
[2012-05-11 09:32:56.120151] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120334] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.120450] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-2: (156): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.120487] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-0: we're done locking
[2012-05-11 09:32:56.120503] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-0: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.120721] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.121838] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.121894] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.121919] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.252114] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.252441] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.252876] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.252958] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.253320] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (135): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.253343] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.253367] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
[2012-05-11 09:32:56.314070] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.314354] D [afr-lk-common.c:410:transaction_lk_op] 0-vmimages-replicate-3: lk op is for a transaction
[2012-05-11 09:32:56.314827] D [afr-transaction.c:973:afr_post_nonblocking_inodelk_cbk] 0-vmimages-replicate-3: Non blocking inodelks failed. Proceeding to blocking
[2012-05-11 09:32:56.315231] D [client3_1-fops.c:4573:client3_1_finodelk] 0-vmimages-client-11: (159): failed to get fd ctx. EBADFD
[2012-05-11 09:32:56.315253] D [afr-lk-common.c:987:afr_lock_blocking] 0-vmimages-replicate-3: we're done locking
[2012-05-11 09:32:56.315269] D [afr-transaction.c:953:afr_post_blocking_inodelk_cbk] 0-vmimages-replicate-3: Blocking inodelks done. Proceeding to FOP
Seems like re-open of the fd after you brought the brick back up did not work fine. It keeps giving EBADFD for the fd. That could be the reason for why the sync is never complete. Writes are not happening on the new brick. How did you perform the disk replace? Did you bring the brick down? I just found that there is a problem in re-open of a file in case of disconnect and re-connect, in client xlator. If at the time of re-open after the brick reconnects, if the file is not present on the brick (since the disk was replaced I am assuming there are no files present on the disk at the time of bringing the brick up) the re-open fails and further operations that happen on the file using that fd will now happen on only one brick (assuming 2 replica). So the file will never be in sync as long as the fd is open. With hosting VMs this a problem. That sounds like that's it. When the drive failed, I killed glusterfsd for that brick, replaced the drive (partitioned, formatted, etc.), then resterted the brick with "gluster volume start vmimages force". CHANGE: http://review.gluster.org/4357 (protocol/client: Add fdctx back to saved-list after reopen) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4358 (protocol/client: Periodically attempt reopens) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4386 (tests: Added util functions) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4387 (test: test re-opens of open-fds) merged in master by Anand Avati (avati) CHANGE: http://review.gluster.org/4464 (protocol/client: Avoid double free of frame) merged in master by Anand Avati (avati) Please backport to release-3.3 CHANGE: http://review.gluster.org/4540 (Tests: Disable open-behind) merged in master by Anand Avati (avati) |