1444892 – When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value.

Bug 1444892 - When either killing or restarting a brick with performance.stat-prefetch on, stat sometimes returns a bad st_size value.

Summary: When either killing or restarting a brick with performance.stat-prefetch on, ...

Keywords:
Status:	CLOSED CURRENTRELEASE
Alias:	None
Product:	GlusterFS
Classification:	Community
Component:	replicate
Sub Component:
Version:	3.10
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	medium
Target Milestone:	---
Assignee:	Ravishankar N
QA Contact:
Docs Contact:
URL:
Whiteboard:
Depends On:	1449329
Blocks:	glusterfs-3.10.2 1449593
TreeView+	depends on / blocked

Reported:	2017-04-24 13:45 UTC by miklos.fokin
Modified:	2017-05-31 20:46 UTC (History)
CC List:	5 users (show)
Fixed In Version:	glusterfs-3.10.2
Clone Of:
Clones:	1449329 1449593 (view as bug list)
Environment:
Last Closed:	2017-05-31 20:46:29 UTC
Regression:	---
Mount Type:	---
Documentation:	---
CRM:
Verified Versions:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Compressed file with main.cpp to generate the application, compile_run.sh to compile and run it, test_fstat.sh to kill and restart one of the brick processes. (1.90 KB, application/x-gzip) 2017-04-24 13:45 UTC, miklos.fokin	no flags	Details
Diff with debug logging and patch to fix the problem (17.16 KB, patch) 2017-05-05 08:11 UTC, miklos.fokin	no flags	Details \| Diff
View All

Description miklos.fokin 2017-04-24 13:45:04 UTC

Created attachment 1273617 [details]
Compressed file with main.cpp to generate the application, compile_run.sh to compile and run it, test_fstat.sh to kill and restart one of the brick processes.

Description of problem:
I have an application which counts how many bytes have been written to a file (the value returned by pwrite gets added to the sum on each iteration).
After the application writes the file it does an fdatasync, and then calls fstat.
I kill one of the brick processes (brick1) and restart it after a few seconds, then give it enough time to heal (checking with volume heal info and grepping for greater than 0 number of entries).

The fstat sometimes returns a bad st_size, even though it doesn't give any errors.
If this happens after a restart it simply returns a size that is a bit less than what it should be (around 100 megabytes).
If this happens after killing the brick process, the size is 0 (happened once with a 900 megabyte file, but otherwise always with less than 100 megabyte long ones).


Version-Release number of selected component (if applicable):
3.10.0


How reproducible:
It happens rarely (once in about 150-300 tries), but with the attached application and script it takes between 5-15 minutes to reproduce on my own computer.


Steps to Reproduce:
1. Create a volume with a replication of 3 from which 1 brick is an arbiter.
2. Run the application (a.out from main.cpp in my scripts, can be automated with compile_run.sh) to create and write a file many times on the volume.
3. Run the script (test_fstat.sh) to continuously kill and restart the first brick.
4. After a while the application should detect a bad fstat st_size, where it will exit, outputting the st_size value and the script will output whether it detected the application exiting after it killed a brick or after restarting it.


Actual results:
Sometimes fstat returns an st_size of either 0, or less than what the file size should be.


Expected results:
fstat should always return the correct file size, or produce an error.


Additional info:
With performance.stat-prefetch turned off I ran the application and the script for about 45 minutes, but had no luck reproducing the bug.
It usually took 5-15 minutes to get an error with the option turned on.
Also the results seem to depend on the number of iterations/writes (thus probably the size of the file being written) in the application.
When the number in the loop's condition (number of 400KB writes on a file) is 150, the problem happens both after killing and restarting.
When the number was higher (200000) I only got bad results after restarts, except once out of around 10 times.

- FUSE mount configuration:
-o direct-io-mode=on passed explicitly to mount

- volume configuration:
nfs.disable: on
transport.address-family: inet
cluster.consistent-metadata: yes
cluster.eager-lock: on
cluster.readdir-optimize: on
cluster.self-heal-readdir-size: 64KB
cluster.self-heal-daemon: on
cluster.read-hash-mode: 2
cluster.use-compound-fops: on
cluster.ensure-durability: on
cluster.entry-self-heal: off
cluster.data-self-heal: off
cluster.metadata-self-heal: off
performance.quick-read: off
performance.md-cache-timeout: 600
performance.cache-invalidation: on
performance.stat-prefetch: off
features.cache-invalidation-timeout: 600
features.cache-invalidation: on
performance.flush-behind: off
performance.write-behind: off
performance.open-behind: off
cluster.background-self-heal-count: 1
network.inode-lru-limit: 1024
network.ping-timeout: 1
performance.io-cache: off
cluster.locking-scheme: granular
cluster.granular-entry-heal: enable

Comment 1 miklos.fokin 2017-04-26 11:04:34 UTC

Yesterday I got a suggestion about a review that might fix the problem, with the changes being in performance.readdir-ahead (https://review.gluster.org/#/c/16419/), but I did manage to reproduce the issue with it being off (and stat-prefetch being on) as well.

Comment 2 miklos.fokin 2017-05-04 13:02:52 UTC

Today I found a part in the code that seems to be the problem when the faulty 0 size from fstat gets shown after a kill.
Although it only has an indirect connection to stat-prefetch, but the issue seems to be what I describe in the following paragraph, as with the solution I suggest I didn't get a reproduction in 30-35 minutes, after which I got the other problem where the size is just a bit lower than what it should be.

In afr_fsync_cbk, when receiving data from the bricks there are two times when the fstat data can get updated: first is the initial update, and then the one from the brick we selected to get the final data from.
During the debugging I found that if the initial update is coming from the arbiter the size will be 0.
If the brick we selected to get the final data from is down, we get a struct filled with zeroes, and an error value, thus we don't get a second update.
The change with which I didn't get a 0 size was simply waiting for a non-arbiter brick to be the first update.

Comment 3 Pranith Kumar K 2017-05-05 05:21:59 UTC

As per the mail and patch on gluster-devel mailing list, it seems to be replicate issue. Changing the component and assignee.

Comment 4 Ravishankar N 2017-05-05 07:29:24 UTC

(In reply to miklos.fokin from comment #2)

> In afr_fsync_cbk, when receiving data from the bricks there are two times
> when the fstat data can get updated: first is the initial update, and then
> the one from the brick we selected to get the final data from.
> During the debugging I found that if the initial update is coming from the
> arbiter the size will be 0.
> If the brick we selected to get the final data from is down, we get a struct
> filled with zeroes, and an error value, thus we don't get a second update.

Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the 'read_subvol' is indeed the brick you brought down? I haven't tried your test yet but if a brick is brought down, then the read_subvol is updated on the next lookup/inode_refresh where the brick is not marked readable any more until heal is complete. So, if you brought down brick1, then the call to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

Comment 5 miklos.fokin 2017-05-05 08:11:01 UTC

Created attachment 1276504 [details]
Diff with debug logging and patch to fix the problem

Comment 6 miklos.fokin 2017-05-05 08:13:45 UTC

(In reply to Ravishankar N from comment #4)
> (In reply to miklos.fokin from comment #2)
> 
> > In afr_fsync_cbk, when receiving data from the bricks there are two times
> > when the fstat data can get updated: first is the initial update, and then
> > the one from the brick we selected to get the final data from.
> > During the debugging I found that if the initial update is coming from the
> > arbiter the size will be 0.
> > If the brick we selected to get the final data from is down, we get a struct
> > filled with zeroes, and an error value, thus we don't get a second update.
> 
> Hi Miklos, did you  observe while debugging that in afr_fsync_cbk(), the
> 'read_subvol' is indeed the brick you brought down? I haven't tried your
> test yet but if a brick is brought down, then the read_subvol is updated on
> the next lookup/inode_refresh where the brick is not marked readable any
> more until heal is complete. So, if you brought down brick1, then the call
> to afr_data_subvol_get() in afr_fsync_cbk() should give you brick2.

Hello Ravishankar, the thing I did was putting some logging into that afr_fsync_cbk() to see what code paths get taken (after doing the same with other ones and finally getting there with the debugging).
I also added logging for printing out the post_buff in each call to the function.

The first log was in "if (op_ret == 0) {" the branch.
Another one was in "if (local->op_ret == -1)".
Another was in "if (child_index == read_subvol)".
I am also attaching the complete diff with the logs and the fix to show you the code, perhaps it is easier to get it from there.

When things went wrong I received 3 calls:
  1 with a mostly correct structure, apart from the number of blocks and size of the file which were zero, here I got a log from the first and second places
  1 with a completely correct structure, but here I only got a log from the first place
  1 with a completely zero structure and no logs from any of the branches

I am writing this from memory, since I moved on and didn't save the logs, but I can reproduce it and send you the file in case you need it.
These findings led me to believe that since the third branch never got taken, it was the third reply that was selected, but since it returned an error it never got processed.

Comment 7 miklos.fokin 2017-05-09 07:44:48 UTC

In my last comment I forgot to mention that for debugging I had to set the mount's attribute-timeout to 0, so that I could get the logs that I wanted.
Thought I would mention that as it might be helpful to someone wanting to do the same.

Comment 8 Ravishankar N 2017-05-09 08:09:28 UTC

Hi Miklos, I was able to re-create the issue with the test case you gave; I got zero st_size within a couple of minutes of running it.I will be sending the patch out soon. The fix is indeed in afr_fsync_cbk with some modifications that should handle even the non-zero but incorrect st_size errors that you observed.

Comment 9 miklos.fokin 2017-05-09 08:23:25 UTC

Hey, great news, thanks!

Comment 10 Worker Ant 2017-05-11 08:23:19 UTC

REVIEW: https://review.gluster.org/17247 (afr: send the correct iatt values in fsync cbk) posted (#1) for review on release-3.10 by Ravishankar N (ravishankar)

Comment 11 Worker Ant 2017-05-14 08:27:38 UTC

COMMIT: https://review.gluster.org/17247 committed in release-3.10 by Raghavendra Talur (rtalur) 
------
commit 76fa6f9e21358db4c6a258afaf48f123b66ab343
Author: Ravishankar N <ravishankar>
Date:   Tue May 9 02:31:39 2017 +0530

    afr: send the correct iatt values in fsync cbk
    
    Problem:
    afr unwinds the fsync fop with an iatt buffer from one of its children
    on whom fsync was successful. But that child might not be a valid read
    subvolume for that inode because of pending heals or because it happens
    to be the arbiter brick etc. Thus we end up sending the wrong iatt to
    mdcache which will in turn serve it to the application on a subsequent
    stat call as reported in the BZ.
    
    Fix:
    Pick a child on whom the fsync was successful *and* that is readable as
    indicated in the inode context.
    
    > Reviewed-on: https://review.gluster.org/17227
    > CentOS-regression: Gluster Build System <jenkins.org>
    > Reviewed-by: Pranith Kumar Karampuri <pkarampu>
    > NetBSD-regression: NetBSD Build System <jenkins.org>
    > Smoke: Gluster Build System <jenkins.org>
    (cherry picked from commit 1a8fa910ccba7aa941f673302c1ddbd7bd818e39)
    
    Change-Id: Ie8647289219cebe02dde4727e19a729b3353ebcf
    BUG: 1444892
    RCA'ed-by: Miklós Fokin <miklos.fokin>
    Signed-off-by: Ravishankar N <ravishankar>
    Reviewed-on: https://review.gluster.org/17247
    Smoke: Gluster Build System <jenkins.org>
    NetBSD-regression: NetBSD Build System <jenkins.org>
    Reviewed-by: Raghavendra Talur <rtalur>
    CentOS-regression: Gluster Build System <jenkins.org>

Comment 12 Ravishankar N 2017-05-15 09:24:54 UTC

Hi Raghavendra, Miklos is still seeing a part of the issue. We continued the discussion on the BZ for the master branch: https://bugzilla.redhat.com/show_bug.cgi?id=1449329#c4

Moving the bug back to POST for now.

Comment 13 Raghavendra Talur 2017-05-31 20:46:29 UTC

This bug is getting closed because a release has been made available that should address the reported issue. In case the problem is still not fixed with glusterfs-3.10.2, please open a new bug report.

Note You need to log in before you can comment on or make changes to this bug.