Bug 1449329
| Summary: | When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value. | |||
|---|---|---|---|---|
| Product: | [Community] GlusterFS | Reporter: | Ravishankar N <ravishankar> | |
| Component: | replicate | Assignee: | Ravishankar N <ravishankar> | |
| Status: | CLOSED CURRENTRELEASE | QA Contact: | ||
| Severity: | medium | Docs Contact: | ||
| Priority: | unspecified | |||
| Version: | mainline | CC: | bugs, miklos.fokin, pkarampu, ravishankar, rtalur | |
| Target Milestone: | --- | Keywords: | Triaged | |
| Target Release: | --- | |||
| Hardware: | x86_64 | |||
| OS: | Linux | |||
| Whiteboard: | ||||
| Fixed In Version: | glusterfs-3.12.0 | Doc Type: | If docs needed, set a value | |
| Doc Text: | Story Points: | --- | ||
| Clone Of: | 1444892 | |||
| : | 1449924 1449941 (view as bug list) | Environment: | ||
| Last Closed: | 2017-09-05 17:29:38 UTC | Type: | Bug | |
| Regression: | --- | Mount Type: | --- | |
| Documentation: | --- | CRM: | ||
| Verified Versions: | Category: | --- | ||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | ||
| Cloudforms Team: | --- | Target Upstream Version: | ||
| Embargoed: | ||||
| Bug Depends On: | ||||
| Bug Blocks: | 1444892, 1449924, 1449941 | |||
|
Description
Ravishankar N
2017-05-09 16:08:54 UTC
REVIEW: https://review.gluster.org/17227 (afr: send the correct iatt values in fsync cbk) posted (#1) for review on master by Ravishankar N (ravishankar) Miklos, you might want to give the patch a test. Or you could wait until it gets reviewed first. I wasn't able to hit the issue any more on 3.10 with the patch. COMMIT: https://review.gluster.org/17227 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 1a8fa910ccba7aa941f673302c1ddbd7bd818e39 Author: Ravishankar N <ravishankar> Date: Tue May 9 02:31:39 2017 +0530 afr: send the correct iatt values in fsync cbk Problem: afr unwinds the fsync fop with an iatt buffer from one of its children on whom fsync was successful. But that child might not be a valid read subvolume for that inode because of pending heals or because it happens to be the arbiter brick etc. Thus we end up sending the wrong iatt to mdcache which will in turn serve it to the application on a subsequent stat call as reported in the BZ. Fix: Pick a child on whom the fsync was successful *and* that is readable as indicated in the inode context. Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf BUG: 1449329 RCA'ed-by: Miklós Fokin <miklos.fokin> Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: https://review.gluster.org/17227 CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org> (In reply to Ravishankar N from comment #2) > Miklos, you might want to give the patch a test. Or you could wait until it > gets reviewed first. I wasn't able to hit the issue any more on 3.10 with > the patch. Hello, sorry for the late response. I tried the test with the patch, the results seem to be the same for me as with the original arbiter check that I implemented. The faulty behavior when killing the brick stopped, so only the problem when restarting the brick remains, when it returns a bit less than what it should. The reproduction time varied widely though, I tried it 3 times and 2 times the reproduction took around 5-10 minutes, while the last one was around 30. (In reply to miklos.fokin from comment #4) > (In reply to Ravishankar N from comment #2) > > Miklos, you might want to give the patch a test. Or you could wait until it > > gets reviewed first. I wasn't able to hit the issue any more on 3.10 with > > the patch. > > Hello, sorry for the late response. > I tried the test with the patch, the results seem to be the same for me as > with the original arbiter check that I implemented. > The faulty behavior when killing the brick stopped, so only the problem when > restarting the brick remains, when it returns a bit less than what it should. > The reproduction time varied widely though, I tried it 3 times and 2 times > the reproduction took around 5-10 minutes, while the last one was around 30. Hi Miklos, the fix was supposed to solve even the lesser stat size problem. I wasn't able to hit it with the fix on a 30 minute run when I tested the patch but if you are able to hit it 3/3 times, looks like there is something more to be fixed. So I ran it for about 90 minutes plus but could not hit the issue.
Can you add the following log to see if afr_fsync_cbk is indeed the one that is sending a short inode size when you run your test?
--------------------------
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 34d9e56..cd03056 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -3371,6 +3371,7 @@ afr_fsync_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
if (i == read_subvol)
break;
}
+ gf_log (this->name, GF_LOG_INFO, "Unwinding with a size of %ld bytes from brick:%d", local->cont.inode_wfop.postbuf.ia_size, read_subvol);
/* Make a stub out of the frame, and register it
with the waking up post-op. When the call-stub resumes,
--------------------------
Hello, sorry for the late response this time as well. I put the logging in place and tried it out. I am pasting the client logs together with the reproduction output. [2017-05-12 08:47:52.845882] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44400000 bytes from brick:1 [2017-05-12 08:47:52.859213] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44800000 bytes from brick:1 [2017-05-12 08:47:52.876972] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45200000 bytes from brick:1 [2017-05-12 08:47:52.893153] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45600000 bytes from brick:1 [2017-05-12 08:47:52.907866] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46000000 bytes from brick:1 [2017-05-12 08:47:52.924883] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46400000 bytes from brick:1 [2017-05-12 08:47:52.940711] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46800000 bytes from brick:1 [2017-05-12 08:47:52.957370] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47200000 bytes from brick:1 [2017-05-12 08:47:52.971701] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47600000 bytes from brick:1 [2017-05-12 08:47:52.993772] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48000000 bytes from brick:1 [2017-05-12 08:47:53.007742] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48400000 bytes from brick:1 [2017-05-12 08:47:53.025668] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48800000 bytes from brick:1 [2017-05-12 08:47:53.038570] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49200000 bytes from brick:1 [2017-05-12 08:47:53.080572] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49600000 bytes from brick:1 fstat fields: in_fstat.st_dev: 47 in_fstat.st_ino: 12217325060665640715 in_fstat.st_mode: 33261 in_fstat.st_nlink: 1 in_fstat.st_uid: 0 in_fstat.st_gid: 0 in_fstat.st_rdev: 0 in_fstat.st_size: 44800000 in_fstat.st_blksize: 131072 in_fstat.st_blocks: 23808 in_fstat.st_atime: 1494578873 in_fstat.st_mtime: 1494578873 in_fstat.st_ctime: 1494578873 2017-05-12 10:47:53.083996: fstat returned a size of 44800000, bytes_written: 49600000 From this I would be guessing that this problem is elsewhere. 1-2 days ago I did check my own logs that I added with the diff I put up here a while ago, and it seemed like this issue is caused when an invalidate call arrives at md-cache, but I haven't yet had time to look at it in more depth, as I'm trying to test other things in the meantime (it might have its roots in afr as well, it is just that I only had logs from md-cache for invalidate). I'm pasting a part of my own logs from another reproduction just in case it might help figuring out what goes wrong. 2017-05-12 12:23:58.122391: mdc_invalidate 2017-05-12 12:23:58.122450: iatt_set_validate 2017-05-12 12:23:58.122464: mdc fields: md_nlink: 1 md_uid: 0 md_gid: 0 md_rdev: 0 md_size: 28800000 md_blocks: 56250 md_atime: 1494584636 md_mtime: 1494584638 md_ctime 1494584638 2017-05-12 12:23:58.122473: prebuf: 2017-05-12 12:23:58.122480: mdc_from_iatt 2017-05-12 12:23:58.122491: iatt fields: ia_dev: 2050 ia_ino: 11843813978708742420 ia_nlink: 1 ia_uid: 0 ia_gid: 0 ia_rdev: 0 ia_size: 0 ia_blksize: 4096 ia_blocks: 0 ia_atime: 1494584638 ia_mtime: 1494584638 ia_ctime 1494584638 2017-05-12 12:23:58.122510: mdc_fstat 2017-05-12 12:23:58.122522: after_local_check 2017-05-12 12:23:58.122544: mdc_to_iatt 2017-05-12 12:23:58.122556: mdc fields: md_nlink: 1 md_uid: 0 md_gid: 0 md_rdev: 0 md_size: 0 md_blocks: 0 md_atime: 1494584638 md_mtime: 1494584638 md_ctime 1494584638 fstat fields: in_fstat.st_dev: 47 in_fstat.st_ino: 11843813978708742420 in_fstat.st_mode: 33261 in_fstat.st_nlink: 1 in_fstat.st_uid: 0 in_fstat.st_gid: 0 in_fstat.st_rdev: 0 in_fstat.st_size: 0 in_fstat.st_blksize: 131072 in_fstat.st_blocks: 0 in_fstat.st_atime: 1494584638 in_fstat.st_mtime: 1494584638 in_fstat.st_ctime: 1494584638 2017-05-12 12:23:58.122649: fstat returned a size of 0, bytes_written: 28800000 Hello,
I've been adding some logging and trying to track down the problem.
It seems like the invalidate calls are also coming from the arbiter.
The following calls: writev, lookup, setattr, stat sometimes trigger an invalidate notification to be sent from the upcall translator (call to upcall_client_cache_invalidate).
I tried adding a check to afr_notify before calling default_notify to filter out messages sent from an arbiter but it did not seem to work (reproduction was still possible afterwards), I might have done it wrong though:
if (propagate && !(event == GF_EVENT_UPCALL && AFR_IS_ARBITER_BRICK(priv, idx)))
ret = default_notify (this, event, data);
Even so I'm pretty sure that it is the arbiter causing the problems because I print the file which the operation is getting executed on in the posix_fdstat, posix_istat, posix_pstat functions in posix-helpers.c right before the invalidation message gets sent.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/ |