+++ This bug was initially created as a clone of Bug #1444892 +++ Description of problem: I have an application which counts how many bytes have been written to a file (the value returned by pwrite gets added to the sum on each iteration). After the application writes the file it does an fdatasync, and then calls fstat. I kill one of the brick processes (brick1) and restart it after a few seconds, then give it enough time to heal (checking with volume heal info and grepping for greater than 0 number of entries). The fstat sometimes returns a bad st_size, even though it doesn't give any errors. If this happens after a restart it simply returns a size that is a bit less than what it should be (around 100 megabytes). If this happens after killing the brick process, the size is 0 (happened once with a 900 megabyte file, but otherwise always with less than 100 megabyte long ones). Version-Release number of selected component (if applicable): 3.10.0 How reproducible: It happens rarely (once in about 150-300 tries), but with the attached application and script it takes between 5-15 minutes to reproduce on my own computer. Steps to Reproduce: 1. Create a volume with a replication of 3 from which 1 brick is an arbiter. 2. Run the application (a.out from main.cpp in my scripts, can be automated with compile_run.sh) to create and write a file many times on the volume. 3. Run the script (test_fstat.sh) to continuously kill and restart the first brick. 4. After a while the application should detect a bad fstat st_size, where it will exit, outputting the st_size value and the script will output whether it detected the application exiting after it killed a brick or after restarting it. Actual results: Sometimes fstat returns an st_size of either 0, or less than what the file size should be. Expected results: fstat should always return the correct file size, or produce an error. Additional info: With performance.stat-prefetch turned off I ran the application and the script for about 45 minutes, but had no luck reproducing the bug. It usually took 5-15 minutes to get an error with the option turned on. Also the results seem to depend on the number of iterations/writes (thus probably the size of the file being written) in the application. When the number in the loop's condition (number of 400KB writes on a file) is 150, the problem happens both after killing and restarting. When the number was higher (200000) I only got bad results after restarts, except once out of around 10 times. - FUSE mount configuration: -o direct-io-mode=on passed explicitly to mount - volume configuration: nfs.disable: on transport.address-family: inet cluster.consistent-metadata: yes cluster.eager-lock: on cluster.readdir-optimize: on cluster.self-heal-readdir-size: 64KB cluster.self-heal-daemon: on cluster.read-hash-mode: 2 cluster.use-compound-fops: on cluster.ensure-durability: on cluster.entry-self-heal: off cluster.data-self-heal: off cluster.metadata-self-heal: off performance.quick-read: off performance.md-cache-timeout: 600 performance.cache-invalidation: on performance.stat-prefetch: off features.cache-invalidation-timeout: 600 features.cache-invalidation: on performance.flush-behind: off performance.write-behind: off performance.open-behind: off cluster.background-self-heal-count: 1 network.inode-lru-limit: 1024 network.ping-timeout: 1 performance.io-cache: off cluster.locking-scheme: granular cluster.granular-entry-heal: enable --- Additional comment from on 2017-04-26 07:04:34 EDT --- Yesterday I got a suggestion about a review that might fix the problem, with the changes being in performance.readdir-ahead (https://review.gluster.org/#/c/16419/), but I did manage to reproduce the issue with it being off (and stat-prefetch being on) as well. --- Additional comment from on 2017-05-04 09:02:52 EDT --- Today I found a part in the code that seems to be the problem when the faulty 0 size from fstat gets shown after a kill. Although it only has an indirect connection to stat-prefetch, but the issue seems to be what I describe in the following paragraph, as with the solution I suggest I didn't get a reproduction in 30-35 minutes, after which I got the other problem where the size is just a bit lower than what it should be. In afr_fsync_cbk, when receiving data from the bricks there are two times when the fstat data can get updated: first is the initial update, and then the one from the brick we selected to get the final data from. During the debugging I found that if the initial update is coming from the arbiter the size will be 0. If the brick we selected to get the final data from is down, we get a struct filled with zeroes, and an error value, thus we don't get a second update. The change with which I didn't get a 0 size was simply waiting for a non-arbiter brick to be the first update. --- Additional comment from Pranith Kumar K on 2017-05-05 01:21:59 EDT --- As per the mail and patch on gluster-devel mailing list, it seems to be replicate issue. Changing the component and assignee. --- Additional comment from Ravishankar N on 2017-05-05 03:29:24 EDT --- (In reply to miklos.fokin from comment #2) > In afr_fsync_cbk, when receiving data from the bricks there are two times > when the fstat data can get updated: first is the initial update, and then > the one from the brick we selected to get the final data from. > During the debugging I found that if the initial update is coming from the > arbiter the size will be 0. > If the brick we selected to get the final data from is down, we get a struct > filled with zeroes, and an error value, thus we don't get a second update. Hi Miklos, did you observe while debugging that in afr_fsync_cbk(), the 'read_subvol' is indeed the brick you brought down? I haven't tried your test yet but if a brick is brought down, then the read_subvol is updated on the next lookup/inode_refresh where the brick is not marked readable any more until heal is complete. So, if you brought down brick1, then the call to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2. --- Additional comment from on 2017-05-05 04:11 EDT --- --- Additional comment from on 2017-05-05 04:13:45 EDT --- (In reply to Ravishankar N from comment #4) > (In reply to miklos.fokin from comment #2) > > > In afr_fsync_cbk, when receiving data from the bricks there are two times > > when the fstat data can get updated: first is the initial update, and then > > the one from the brick we selected to get the final data from. > > During the debugging I found that if the initial update is coming from the > > arbiter the size will be 0. > > If the brick we selected to get the final data from is down, we get a struct > > filled with zeroes, and an error value, thus we don't get a second update. > > Hi Miklos, did you observe while debugging that in afr_fsync_cbk(), the > 'read_subvol' is indeed the brick you brought down? I haven't tried your > test yet but if a brick is brought down, then the read_subvol is updated on > the next lookup/inode_refresh where the brick is not marked readable any > more until heal is complete. So, if you brought down brick1, then the call > to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2. Hello Ravishankar, the thing I did was putting some logging into that afr_fsync_cbk() to see what code paths get taken (after doing the same with other ones and finally getting there with the debugging). I also added logging for printing out the post_buff in each call to the function. The first log was in "if (op_ret == 0) {" the branch. Another one was in "if (local->op_ret == -1)". Another was in "if (child_index == read_subvol)". I am also attaching the complete diff with the logs and the fix to show you the code, perhaps it is easier to get it from there. When things went wrong I received 3 calls: 1 with a mostly correct structure, apart from the number of blocks and size of the file which were zero, here I got a log from the first and second places 1 with a completely correct structure, but here I only got a log from the first place 1 with a completely zero structure and no logs from any of the branches I am writing this from memory, since I moved on and didn't save the logs, but I can reproduce it and send you the file in case you need it. These findings led me to believe that since the third branch never got taken, it was the third reply that was selected, but since it returned an error it never got processed. --- Additional comment from on 2017-05-09 03:44:48 EDT --- In my last comment I forgot to mention that for debugging I had to set the mount's attribute-timeout to 0, so that I could get the logs that I wanted. Thought I would mention that as it might be helpful to someone wanting to do the same. --- Additional comment from Ravishankar N on 2017-05-09 04:09:28 EDT --- Hi Miklos, I was able to re-create the issue with the test case you gave; I got zero st_size within a couple of minutes of running it.I will be sending the patch out soon. The fix is indeed in afr_fsync_cbk with some modifications that should handle even the non-zero but incorrect st_size errors that you observed. --- Additional comment from on 2017-05-09 04:23:25 EDT --- Hey, great news, thanks!
REVIEW: https://review.gluster.org/17227 (afr: send the correct iatt values in fsync cbk) posted (#1) for review on master by Ravishankar N (ravishankar)
Miklos, you might want to give the patch a test. Or you could wait until it gets reviewed first. I wasn't able to hit the issue any more on 3.10 with the patch.
COMMIT: https://review.gluster.org/17227 committed in master by Pranith Kumar Karampuri (pkarampu) ------ commit 1a8fa910ccba7aa941f673302c1ddbd7bd818e39 Author: Ravishankar N <ravishankar> Date: Tue May 9 02:31:39 2017 +0530 afr: send the correct iatt values in fsync cbk Problem: afr unwinds the fsync fop with an iatt buffer from one of its children on whom fsync was successful. But that child might not be a valid read subvolume for that inode because of pending heals or because it happens to be the arbiter brick etc. Thus we end up sending the wrong iatt to mdcache which will in turn serve it to the application on a subsequent stat call as reported in the BZ. Fix: Pick a child on whom the fsync was successful *and* that is readable as indicated in the inode context. Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf BUG: 1449329 RCA'ed-by: Miklós Fokin <miklos.fokin> Signed-off-by: Ravishankar N <ravishankar> Reviewed-on: https://review.gluster.org/17227 CentOS-regression: Gluster Build System <jenkins.org> Reviewed-by: Pranith Kumar Karampuri <pkarampu> NetBSD-regression: NetBSD Build System <jenkins.org> Smoke: Gluster Build System <jenkins.org>
(In reply to Ravishankar N from comment #2) > Miklos, you might want to give the patch a test. Or you could wait until it > gets reviewed first. I wasn't able to hit the issue any more on 3.10 with > the patch. Hello, sorry for the late response. I tried the test with the patch, the results seem to be the same for me as with the original arbiter check that I implemented. The faulty behavior when killing the brick stopped, so only the problem when restarting the brick remains, when it returns a bit less than what it should. The reproduction time varied widely though, I tried it 3 times and 2 times the reproduction took around 5-10 minutes, while the last one was around 30.
(In reply to miklos.fokin from comment #4) > (In reply to Ravishankar N from comment #2) > > Miklos, you might want to give the patch a test. Or you could wait until it > > gets reviewed first. I wasn't able to hit the issue any more on 3.10 with > > the patch. > > Hello, sorry for the late response. > I tried the test with the patch, the results seem to be the same for me as > with the original arbiter check that I implemented. > The faulty behavior when killing the brick stopped, so only the problem when > restarting the brick remains, when it returns a bit less than what it should. > The reproduction time varied widely though, I tried it 3 times and 2 times > the reproduction took around 5-10 minutes, while the last one was around 30. Hi Miklos, the fix was supposed to solve even the lesser stat size problem. I wasn't able to hit it with the fix on a 30 minute run when I tested the patch but if you are able to hit it 3/3 times, looks like there is something more to be fixed.
So I ran it for about 90 minutes plus but could not hit the issue. Can you add the following log to see if afr_fsync_cbk is indeed the one that is sending a short inode size when you run your test? -------------------------- diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c index 34d9e56..cd03056 100644 --- a/xlators/cluster/afr/src/afr-common.c +++ b/xlators/cluster/afr/src/afr-common.c @@ -3371,6 +3371,7 @@ afr_fsync_cbk (call_frame_t *frame, void *cookie, xlator_t *this, if (i == read_subvol) break; } + gf_log (this->name, GF_LOG_INFO, "Unwinding with a size of %ld bytes from brick:%d", local->cont.inode_wfop.postbuf.ia_size, read_subvol); /* Make a stub out of the frame, and register it with the waking up post-op. When the call-stub resumes, --------------------------
Hello, sorry for the late response this time as well. I put the logging in place and tried it out. I am pasting the client logs together with the reproduction output. [2017-05-12 08:47:52.845882] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44400000 bytes from brick:1 [2017-05-12 08:47:52.859213] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44800000 bytes from brick:1 [2017-05-12 08:47:52.876972] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45200000 bytes from brick:1 [2017-05-12 08:47:52.893153] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45600000 bytes from brick:1 [2017-05-12 08:47:52.907866] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46000000 bytes from brick:1 [2017-05-12 08:47:52.924883] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46400000 bytes from brick:1 [2017-05-12 08:47:52.940711] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46800000 bytes from brick:1 [2017-05-12 08:47:52.957370] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47200000 bytes from brick:1 [2017-05-12 08:47:52.971701] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47600000 bytes from brick:1 [2017-05-12 08:47:52.993772] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48000000 bytes from brick:1 [2017-05-12 08:47:53.007742] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48400000 bytes from brick:1 [2017-05-12 08:47:53.025668] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48800000 bytes from brick:1 [2017-05-12 08:47:53.038570] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49200000 bytes from brick:1 [2017-05-12 08:47:53.080572] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49600000 bytes from brick:1 fstat fields: in_fstat.st_dev: 47 in_fstat.st_ino: 12217325060665640715 in_fstat.st_mode: 33261 in_fstat.st_nlink: 1 in_fstat.st_uid: 0 in_fstat.st_gid: 0 in_fstat.st_rdev: 0 in_fstat.st_size: 44800000 in_fstat.st_blksize: 131072 in_fstat.st_blocks: 23808 in_fstat.st_atime: 1494578873 in_fstat.st_mtime: 1494578873 in_fstat.st_ctime: 1494578873 2017-05-12 10:47:53.083996: fstat returned a size of 44800000, bytes_written: 49600000 From this I would be guessing that this problem is elsewhere. 1-2 days ago I did check my own logs that I added with the diff I put up here a while ago, and it seemed like this issue is caused when an invalidate call arrives at md-cache, but I haven't yet had time to look at it in more depth, as I'm trying to test other things in the meantime (it might have its roots in afr as well, it is just that I only had logs from md-cache for invalidate). I'm pasting a part of my own logs from another reproduction just in case it might help figuring out what goes wrong. 2017-05-12 12:23:58.122391: mdc_invalidate 2017-05-12 12:23:58.122450: iatt_set_validate 2017-05-12 12:23:58.122464: mdc fields: md_nlink: 1 md_uid: 0 md_gid: 0 md_rdev: 0 md_size: 28800000 md_blocks: 56250 md_atime: 1494584636 md_mtime: 1494584638 md_ctime 1494584638 2017-05-12 12:23:58.122473: prebuf: 2017-05-12 12:23:58.122480: mdc_from_iatt 2017-05-12 12:23:58.122491: iatt fields: ia_dev: 2050 ia_ino: 11843813978708742420 ia_nlink: 1 ia_uid: 0 ia_gid: 0 ia_rdev: 0 ia_size: 0 ia_blksize: 4096 ia_blocks: 0 ia_atime: 1494584638 ia_mtime: 1494584638 ia_ctime 1494584638 2017-05-12 12:23:58.122510: mdc_fstat 2017-05-12 12:23:58.122522: after_local_check 2017-05-12 12:23:58.122544: mdc_to_iatt 2017-05-12 12:23:58.122556: mdc fields: md_nlink: 1 md_uid: 0 md_gid: 0 md_rdev: 0 md_size: 0 md_blocks: 0 md_atime: 1494584638 md_mtime: 1494584638 md_ctime 1494584638 fstat fields: in_fstat.st_dev: 47 in_fstat.st_ino: 11843813978708742420 in_fstat.st_mode: 33261 in_fstat.st_nlink: 1 in_fstat.st_uid: 0 in_fstat.st_gid: 0 in_fstat.st_rdev: 0 in_fstat.st_size: 0 in_fstat.st_blksize: 131072 in_fstat.st_blocks: 0 in_fstat.st_atime: 1494584638 in_fstat.st_mtime: 1494584638 in_fstat.st_ctime: 1494584638 2017-05-12 12:23:58.122649: fstat returned a size of 0, bytes_written: 28800000
Hello, I've been adding some logging and trying to track down the problem. It seems like the invalidate calls are also coming from the arbiter. The following calls: writev, lookup, setattr, stat sometimes trigger an invalidate notification to be sent from the upcall translator (call to upcall_client_cache_invalidate). I tried adding a check to afr_notify before calling default_notify to filter out messages sent from an arbiter but it did not seem to work (reproduction was still possible afterwards), I might have done it wrong though: if (propagate && !(event == GF_EVENT_UPCALL && AFR_IS_ARBITER_BRICK(priv, idx))) ret = default_notify (this, event, data); Even so I'm pretty sure that it is the arbiter causing the problems because I print the file which the operation is getting executed on in the posix_fdstat, posix_istat, posix_pstat functions in posix-helpers.c right before the invalidation message gets sent.
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report. glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution. [1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html [2] https://www.gluster.org/pipermail/gluster-users/