Bug 1449329 - When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value.
Summary: When either killing or restarting a brick with performance.stat-prefetch on, ...
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: GlusterFS
Classification: Community
Component: replicate
Version: mainline
Hardware: x86_64
OS: Linux
unspecified
medium
Target Milestone: ---
Assignee: Ravishankar N
QA Contact:
URL:
Whiteboard:
Depends On:
Blocks: 1444892 1449924 1449941
TreeView+ depends on / blocked
 
Reported: 2017-05-09 16:08 UTC by Ravishankar N
Modified: 2017-09-05 17:29 UTC (History)
5 users (show)

Fixed In Version: glusterfs-3.12.0
Doc Type: If docs needed, set a value
Doc Text:
Clone Of: 1444892
: 1449924 1449941 (view as bug list)
Environment:
Last Closed: 2017-09-05 17:29:38 UTC
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Embargoed:


Attachments (Terms of Use)

Description Ravishankar N 2017-05-09 16:08:54 UTC
+++ This bug was initially created as a clone of Bug #1444892 +++

Description of problem:
I have an application which counts how many bytes have been written to a file (the value returned by pwrite gets added to the sum on each iteration).
After the application writes the file it does an fdatasync, and then calls fstat.
I kill one of the brick processes (brick1) and restart it after a few seconds, then give it enough time to heal (checking with volume heal info and grepping for greater than 0 number of entries).

The fstat sometimes returns a bad st_size, even though it doesn't give any errors.
If this happens after a restart it simply returns a size that is a bit less than what it should be (around 100 megabytes).
If this happens after killing the brick process, the size is 0 (happened once with a 900 megabyte file, but otherwise always with less than 100 megabyte long ones).


Version-Release number of selected component (if applicable):
3.10.0


How reproducible:
It happens rarely (once in about 150-300 tries), but with the attached application and script it takes between 5-15 minutes to reproduce on my own computer.


Steps to Reproduce:
1. Create a volume with a replication of 3 from which 1 brick is an arbiter.
2. Run the application (a.out from main.cpp in my scripts, can be automated with compile_run.sh) to create and write a file many times on the volume.
3. Run the script (test_fstat.sh) to continuously kill and restart the first brick.
4. After a while the application should detect a bad fstat st_size, where it will exit, outputting the st_size value and the script will output whether it detected the application exiting after it killed a brick or after restarting it.


Actual results:
Sometimes fstat returns an st_size of either 0, or less than what the file size should be.


Expected results:
fstat should always return the correct file size, or produce an error.


Additional info:
With performance.stat-prefetch turned off I ran the application and the script for about 45 minutes, but had no luck reproducing the bug.
It usually took 5-15 minutes to get an error with the option turned on.
Also the results seem to depend on the number of iterations/writes (thus probably the size of the file being written) in the application.
When the number in the loop's condition (number of 400KB writes on a file) is 150, the problem happens both after killing and restarting.
When the number was higher (200000) I only got bad results after restarts, except once out of around 10 times.

- FUSE mount configuration:
-o direct-io-mode=on passed explicitly to mount

- volume configuration:
nfs.disable: on
transport.address-family: inet
cluster.consistent-metadata: yes
cluster.eager-lock: on
cluster.readdir-optimize: on
cluster.self-heal-readdir-size: 64KB
cluster.self-heal-daemon: on
cluster.read-hash-mode: 2
cluster.use-compound-fops: on
cluster.ensure-durability: on
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.quick-read: off
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.flush-behind: off
performance.write-behind: off
performance.open-behind: off
cluster.background-self-heal-count: 1
network.inode-lru-limit: 1024
network.ping-timeout: 1
performance.io-cache: off
cluster.locking-scheme: granular
cluster.granular-entry-heal: enable

--- Additional comment from  on 2017-04-26 07:04:34 EDT ---

Yesterday I got a suggestion about a review that might fix the problem, with the changes being in performance.readdir-ahead (https://review.gluster.org/#/c/16419/), but I did manage to reproduce the issue with it being off (and stat-prefetch being on) as well.

--- Additional comment from  on 2017-05-04 09:02:52 EDT ---

Today I found a part in the code that seems to be the problem when the faulty 0 size from fstat gets shown after a kill.
Although it only has an indirect connection to stat-prefetch, but the issue seems to be what I describe in the following paragraph, as with the solution I suggest I didn't get a reproduction in 30-35 minutes, after which I got the other problem where the size is just a bit lower than what it should be.

In afr_fsync_cbk, when receiving data from the bricks there are two times when the fstat data can get updated: first is the initial update, and then the one from the brick we selected to get the final data from.
During the debugging I found that if the initial update is coming from the arbiter the size will be 0.
If the brick we selected to get the final data from is down, we get a struct filled with zeroes, and an error value, thus we don't get a second update.
The change with which I didn't get a 0 size was simply waiting for a non-arbiter brick to be the first update.

--- Additional comment from Pranith Kumar K on 2017-05-05 01:21:59 EDT ---

As per the mail and patch on gluster-devel mailing list, it seems to be replicate issue. Changing the component and assignee.

--- Additional comment from Ravishankar N on 2017-05-05 03:29:24 EDT ---

(In reply to miklos.fokin from comment #2)

> In afr_fsync_cbk, when receiving data from the bricks there are two times
> when the fstat data can get updated: first is the initial update, and then
> the one from the brick we selected to get the final data from.
> During the debugging I found that if the initial update is coming from the
> arbiter the size will be 0.
> If the brick we selected to get the final data from is down, we get a struct
> filled with zeroes, and an error value, thus we don't get a second update.

Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the 'read_subvol' is indeed the brick you brought down? I haven't tried your test yet but if a brick is brought down, then the read_subvol is updated on the next lookup/inode_refresh where the brick is not marked readable any more until heal is complete. So, if you brought down brick1, then the call to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

--- Additional comment from  on 2017-05-05 04:11 EDT ---



--- Additional comment from  on 2017-05-05 04:13:45 EDT ---

(In reply to Ravishankar N from comment #4)
> (In reply to miklos.fokin from comment #2)
> 
> > In afr_fsync_cbk, when receiving data from the bricks there are two times
> > when the fstat data can get updated: first is the initial update, and then
> > the one from the brick we selected to get the final data from.
> > During the debugging I found that if the initial update is coming from the
> > arbiter the size will be 0.
> > If the brick we selected to get the final data from is down, we get a struct
> > filled with zeroes, and an error value, thus we don't get a second update.
> 
> Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the
> 'read_subvol' is indeed the brick you brought down? I haven't tried your
> test yet but if a brick is brought down, then the read_subvol is updated on
> the next lookup/inode_refresh where the brick is not marked readable any
> more until heal is complete. So, if you brought down brick1, then the call
> to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

Hello Ravishankar, the thing I did was putting some logging into that afr_fsync_cbk() to see what code paths get taken (after doing the same with other ones and finally getting there with the debugging).
I also added logging for printing out the post_buff in each call to the function.

The first log was in "if (op_ret == 0) {" the branch.
Another one was in "if (local->op_ret == -1)".
Another was in "if (child_index == read_subvol)".
I am also attaching the complete diff with the logs and the fix to show you the code, perhaps it is easier to get it from there.

When things went wrong I received 3 calls:
  1 with a mostly correct structure, apart from the number of blocks and size of the file which were zero, here I got a log from the first and second places
  1 with a completely correct structure, but here I only got a log from the first place
  1 with a completely zero structure and no logs from any of the branches

I am writing this from memory, since I moved on and didn't save the logs, but I can reproduce it and send you the file in case you need it.
These findings led me to believe that since the third branch never got taken, it was the third reply that was selected, but since it returned an error it never got processed.

--- Additional comment from  on 2017-05-09 03:44:48 EDT ---

In my last comment I forgot to mention that for debugging I had to set the mount's attribute-timeout to 0, so that I could get the logs that I wanted.
Thought I would mention that as it might be helpful to someone wanting to do the same.

--- Additional comment from Ravishankar N on 2017-05-09 04:09:28 EDT ---

Hi Miklos, I was able to re-create the issue with the test case you gave; I got zero st_size within a couple of minutes of running it.I will be sending the patch out soon. The fix is indeed in afr_fsync_cbk with some modifications that should handle even the non-zero but incorrect st_size errors that you observed.

--- Additional comment from  on 2017-05-09 04:23:25 EDT ---

Hey, great news, thanks!

Comment 1 Worker Ant 2017-05-09 16:15:41 UTC
REVIEW: https://review.gluster.org/17227 (afr: send the correct iatt values in fsync cbk) posted (#1) for review on master by Ravishankar N (ravishankar)

Comment 2 Ravishankar N 2017-05-09 16:18:36 UTC
Miklos, you might want to give the patch a test. Or you could wait until it gets reviewed first. I wasn't able to hit the issue any more on 3.10 with the patch.

Comment 3 Worker Ant 2017-05-11 06:34:50 UTC
COMMIT: https://review.gluster.org/17227 committed in master by Pranith Kumar Karampuri (pkarampu) 
------
commit 1a8fa910ccba7aa941f673302c1ddbd7bd818e39
Author: Ravishankar N <ravishankar>
Date:   Tue May 9 02:31:39 2017 +0530

    afr: send the correct iatt values in fsync cbk
    
    Problem:
    afr unwinds the fsync fop with an iatt buffer from one of its children
    on whom fsync was successful. But that child might not be a valid read
    subvolume for that inode because of pending heals or because it happens
    to be the arbiter brick etc. Thus we end up sending the wrong iatt to
    mdcache which will in turn serve it to the application on a subsequent
    stat call as reported in the BZ.
    
    Fix:
    Pick a child on whom the fsync was successful *and* that is readable as
    indicated in the inode context.
    
    Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf
    BUG: 1449329
    RCA'ed-by: Miklós Fokin <miklos.fokin>
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: https://review.gluster.org/17227
    CentOS-regression: Gluster Build System <jenkins.org>
    Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Smoke: Gluster Build System <jenkins.org>

Comment 4 miklos.fokin 2017-05-11 08:38:40 UTC
(In reply to Ravishankar N from comment #2)
> Miklos, you might want to give the patch a test. Or you could wait until it
> gets reviewed first. I wasn't able to hit the issue any more on 3.10 with
> the patch.

Hello, sorry for the late response.
I tried the test with the patch, the results seem to be the same for me as with the original arbiter check that I implemented.
The faulty behavior when killing the brick stopped, so only the problem when restarting the brick remains, when it returns a bit less than what it should.
The reproduction time varied widely though, I tried it 3 times and 2 times the reproduction took around 5-10 minutes, while the last one was around 30.

Comment 5 Ravishankar N 2017-05-11 08:52:33 UTC
(In reply to miklos.fokin from comment #4)
> (In reply to Ravishankar N from comment #2)
> > Miklos, you might want to give the patch a test. Or you could wait until it
> > gets reviewed first. I wasn't able to hit the issue any more on 3.10 with
> > the patch.
> 
> Hello, sorry for the late response.
> I tried the test with the patch, the results seem to be the same for me as
> with the original arbiter check that I implemented.
> The faulty behavior when killing the brick stopped, so only the problem when
> restarting the brick remains, when it returns a bit less than what it should.
> The reproduction time varied widely though, I tried it 3 times and 2 times
> the reproduction took around 5-10 minutes, while the last one was around 30.

Hi Miklos, the fix was supposed to solve even the lesser stat size problem. I wasn't able to hit it with the fix on a 30 minute run when I tested the patch but if you are able to hit it 3/3 times, looks like there is something more to be fixed.

Comment 6 Ravishankar N 2017-05-11 13:06:42 UTC
So I ran it for about 90 minutes plus  but could not hit the issue.

Can you add the following log to see if afr_fsync_cbk is indeed the one that is sending a short inode size when you run your test?
--------------------------
diff --git a/xlators/cluster/afr/src/afr-common.c b/xlators/cluster/afr/src/afr-common.c
index 34d9e56..cd03056 100644
--- a/xlators/cluster/afr/src/afr-common.c
+++ b/xlators/cluster/afr/src/afr-common.c
@@ -3371,6 +3371,7 @@ afr_fsync_cbk (call_frame_t *frame, void *cookie, xlator_t *this,
                         if (i == read_subvol)
                                 break;
                 }
+                gf_log (this->name, GF_LOG_INFO, "Unwinding with a size of %ld bytes from brick:%d", local->cont.inode_wfop.postbuf.ia_size, read_subvol);

                /* Make a stub out of the frame, and register it
                   with the waking up post-op. When the call-stub resumes,

--------------------------

Comment 7 miklos.fokin 2017-05-12 10:34:00 UTC
Hello, sorry for the late response this time as well.
I put the logging in place and tried it out.
I am pasting the client logs together with the reproduction output.

[2017-05-12 08:47:52.845882] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44400000 bytes from brick:1
[2017-05-12 08:47:52.859213] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 44800000 bytes from brick:1
[2017-05-12 08:47:52.876972] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45200000 bytes from brick:1
[2017-05-12 08:47:52.893153] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 45600000 bytes from brick:1
[2017-05-12 08:47:52.907866] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46000000 bytes from brick:1
[2017-05-12 08:47:52.924883] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46400000 bytes from brick:1
[2017-05-12 08:47:52.940711] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 46800000 bytes from brick:1
[2017-05-12 08:47:52.957370] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47200000 bytes from brick:1
[2017-05-12 08:47:52.971701] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 47600000 bytes from brick:1
[2017-05-12 08:47:52.993772] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48000000 bytes from brick:1
[2017-05-12 08:47:53.007742] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48400000 bytes from brick:1
[2017-05-12 08:47:53.025668] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 48800000 bytes from brick:1
[2017-05-12 08:47:53.038570] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49200000 bytes from brick:1
[2017-05-12 08:47:53.080572] I [afr-common.c:3439:afr_fsync_cbk] 0-volume1-replicate-0: Unwinding with a size of 49600000 bytes from brick:1

fstat fields:
in_fstat.st_dev: 47
in_fstat.st_ino: 12217325060665640715
in_fstat.st_mode: 33261
in_fstat.st_nlink: 1
in_fstat.st_uid: 0
in_fstat.st_gid: 0
in_fstat.st_rdev: 0
in_fstat.st_size: 44800000
in_fstat.st_blksize: 131072
in_fstat.st_blocks: 23808
in_fstat.st_atime: 1494578873
in_fstat.st_mtime: 1494578873
in_fstat.st_ctime: 1494578873
2017-05-12 10:47:53.083996: fstat returned a size of 44800000, bytes_written: 49600000

From this I would be guessing that this problem is elsewhere.
1-2 days ago I did check my own logs that I added with the diff I put up here a while ago, and it seemed like this issue is caused when an invalidate call arrives at md-cache, but I haven't yet had time to look at it in more depth, as I'm trying to test other things in the meantime (it might have its roots in afr as well, it is just that I only had logs from md-cache for invalidate).
I'm pasting a part of my own logs from another reproduction just in case it might help figuring out what goes wrong.


2017-05-12 12:23:58.122391: mdc_invalidate
2017-05-12 12:23:58.122450: iatt_set_validate
2017-05-12 12:23:58.122464: mdc fields:
md_nlink: 1
md_uid: 0
md_gid: 0
md_rdev: 0
md_size: 28800000
md_blocks: 56250
md_atime: 1494584636
md_mtime: 1494584638
md_ctime 1494584638

2017-05-12 12:23:58.122473: prebuf:
2017-05-12 12:23:58.122480: mdc_from_iatt
2017-05-12 12:23:58.122491: iatt fields:
ia_dev: 2050
ia_ino: 11843813978708742420
ia_nlink: 1
ia_uid: 0
ia_gid: 0
ia_rdev: 0
ia_size: 0
ia_blksize: 4096
ia_blocks: 0
ia_atime: 1494584638
ia_mtime: 1494584638
ia_ctime 1494584638

2017-05-12 12:23:58.122510: mdc_fstat
2017-05-12 12:23:58.122522: after_local_check
2017-05-12 12:23:58.122544: mdc_to_iatt
2017-05-12 12:23:58.122556: mdc fields:
md_nlink: 1
md_uid: 0
md_gid: 0
md_rdev: 0
md_size: 0
md_blocks: 0
md_atime: 1494584638
md_mtime: 1494584638
md_ctime 1494584638



fstat fields:
in_fstat.st_dev: 47
in_fstat.st_ino: 11843813978708742420
in_fstat.st_mode: 33261
in_fstat.st_nlink: 1
in_fstat.st_uid: 0
in_fstat.st_gid: 0
in_fstat.st_rdev: 0
in_fstat.st_size: 0
in_fstat.st_blksize: 131072
in_fstat.st_blocks: 0
in_fstat.st_atime: 1494584638
in_fstat.st_mtime: 1494584638
in_fstat.st_ctime: 1494584638
2017-05-12 12:23:58.122649: fstat returned a size of 0, bytes_written: 28800000

Comment 8 miklos.fokin 2017-05-22 15:10:15 UTC
Hello,
I've been adding some logging and trying to track down the problem.
It seems like the invalidate calls are also coming from the arbiter.
The following calls: writev, lookup, setattr, stat sometimes trigger an invalidate notification to be sent from the upcall translator (call to upcall_client_cache_invalidate).
I tried adding a check to afr_notify before calling default_notify to filter out messages sent from an arbiter but it did not seem to work (reproduction was still possible afterwards), I might have done it wrong though:
        if (propagate && !(event == GF_EVENT_UPCALL && AFR_IS_ARBITER_BRICK(priv, idx)))
                ret = default_notify (this, event, data);
Even so I'm pretty sure that it is the arbiter causing the problems because I print the file which the operation is getting executed on in the posix_fdstat, posix_istat, posix_pstat functions in posix-helpers.c right before the invalidation message gets sent.

Comment 9 Shyamsundar 2017-09-05 17:29:38 UTC
This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.12.0, please open a new bug report.

glusterfs-3.12.0 has been announced on the Gluster mailinglists [1], packages for several distributions should become available in the near future. Keep an eye on the Gluster Users mailinglist [2] and the update infrastructure for your distribution.

[1] http://lists.gluster.org/pipermail/announce/2017-September/000082.html
[2] https://www.gluster.org/pipermail/gluster-users/


Note You need to log in before you can comment on or make changes to this bug.