Bug 1773476

Summary: gluster does not return correct filesize and blocksize after ftruncate
Product: [Community] GlusterFS Reporter: zhou lin <zz.sh.cynthia>
Component: coreAssignee: bugs <bugs>
Status: CLOSED UPSTREAM QA Contact:
Severity: high Docs Contact:
Priority: unspecified    
Version: 7CC: bugs, pasik, ravishankar
Target Milestone: ---Keywords: Triaged
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-03-12 12:16:55 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
execute this file with a glusterfs file as argument none

Description zhou lin 2019-11-18 08:53:40 UTC
Created attachment 1637237 [details]
execute this file with a glusterfs file as argument

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 1 zhou lin 2019-11-18 08:55:11 UTC
During my recent test of glusterfs7 I find a very confusing issue.it is a permanent issue that with enclosed test.c test crash each time(not glusterfs but the test app).
Also, ftruncate the fstat does not return correct file size and block size and I think this is a bug of glusterfs7, when I use glusterfs 3.12 there is no such issue.

Test steps
Pre-codition: set volume cluster.consistent-metadata: on
Just execute test.c choose a glusterfs folder file to be the argument
Test.c logic:
1>	create a file with fopen
2>	fstat the file
3>	ftruncate the file
4>	mmap the file
5>	write to file - crash
# ./test /mnt/export/testrty
ID of containing device:  [0,2a]
File type:                regular file
I-node number:            -5612996300533715526
Mode:                     100644 (octal)
Link count:               1
Ownership:                UID=0   GID=615
Preferred I/O block size: 131072 bytes
File size:                0 bytes
Blocks allocated:         0
Last status change:       Mon Nov 18 09:53:40 2019
Last file access:         Mon Nov 18 09:53:40 2019
Last file modification:   Mon Nov 18 09:53:40 2019
ftruncate is called, size= 52

File block after truncate: 0
File size after truncate: 0
Bus error (core dumped)#   = crash happen in snprintf((char *)m_cells, sizeof(struct DataStorageCell),
            "persistent data %ui", (unsigned int)sizeof(struct DataStorageCell));


[New LWP 7198]
Core was generated by `./a.out /mnt/log/test'.
Program terminated with signal SIGBUS, Bus error.
#0  0x00007f66829c4c5d in vsnprintf () from /lib64/libc.so.6
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.28-40.wf30.x86_64
(gdb) bt
#0  0x00007f66829c4c5d in vsnprintf () from /lib64/libc.so.6
#1  0x00007f66829a4263 in snprintf () from /lib64/libc.so.6
#2  0x00000000004015f6 in main ()
(gdb) quit
[root@mn-0:/var/lib/systemd/coredump]
# cd /home/robot
[root@mn-0:/home/robot]



When I set cluster.consistent-metadata to “off”, this crash issue disappear, and the filesize and blocks is correct!!
[root@mn-0:/home/robot]
# ./test /mnt/export/testrtyer
ID of containing device:  [0,2a]
File type:                regular file
I-node number:            -5858193331384707222
Mode:                     100644 (octal)
Link count:               1
Ownership:                UID=0   GID=615
Preferred I/O block size: 131072 bytes
File size:                0 bytes
Blocks allocated:         0
Last status change:       Mon Nov 18 09:55:57 2019
Last file access:         Mon Nov 18 09:55:57 2019
Last file modification:   Mon Nov 18 09:55:57 2019
ftruncate is called, size= 52

File block after truncate: 1
File size after truncate: 52
[root@mn-0:/home/robot]




After I study the code I think the difference is when consistent metadata option is on
__afr_inode_write_cbk will clean the prebuf and postbuf, this cause the fuse get all-zero prebuf and postbuf , so fstat get 0 filesize and blocks, and cause app crashed in snprintf .
This issue is permanent in my test env(glusterfs7.0 community code), could you please try this test.c I think this is a bug, in glusterfs3.12 version there is no such issue.

Some config in my env:
[root@mn-0:/home/robot]
# gluster v info export

Volume Name: export
Type: Replicate
Volume ID: 69b29303-36e0-4146-b4d5-adf0dcbe3a48
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: mn-0.local:/mnt/bricks/export/brick
Brick2: mn-1.local:/mnt/bricks/export/brick
Brick3: dbm-0.local:/mnt/bricks/export/brick
Options Reconfigured:
cluster.entry-self-heal: on
cluster.data-self-heal: on
cluster.metadata-self-heal: on
performance.client-io-threads: off
cluster.heal-timeout: 60
cluster.favorite-child-policy: mtime
network.ping-timeout: 42
server.allow-insecure: on
cluster.consistent-metadata: on
cluster.quorum-reads: true
cluster.quorum-type: auto
cluster.server-quorum-type: none
transport.address-family: inet
storage.fips-mode-rchecksum: on
nfs.disable: on
diagnostics.client-log-level: INFO
cluster.server-quorum-ratio: 51%

Comment 2 Ravishankar N 2019-11-20 06:28:58 UTC
Looks like a fuse (kernel) caching issue to me. If I mount the volume with attribute-timeout and entry-timeout set to zero, I don't see the crash and I get the correct stat size 
Can you confirm if this works?
1. mount -t glusterfs -o attribute-timeout=0 -o entry-timeout=0   IP:volname  /mout-point
2. Run your test.c program.

Comment 3 Ravishankar N 2019-11-20 06:40:09 UTC
(In reply to Ravishankar N from comment #2)
> Looks like a fuse (kernel) caching issue to me. 
But since there is no crash on glusterfs-3.12 using the same fuse kernel bits, there might be something more to it in gluster itself that needs further investigation.

Comment 4 zhou lin 2019-11-25 05:32:13 UTC
ok, i will test

Comment 5 Ravishankar N 2019-11-25 05:58:07 UTC
(In reply to zhou lin from comment #4)
> ok, i will test

Thanks. I haven't found what the problem is yet but in case if this helps you debug yourself, I found a fairly simple reproducer: The problem is that if you immediately do a stat after truncate (to non-zero size), it is getting size as zero as you observed. But if you wait for 1 second, it gets the right size.

[root@vm1 fuse_mnt]# TIME=0
[root@vm1 fuse_mnt]# FILE=file1

[root@vm1 fuse_mnt]# truncate -s 52 $FILE && sleep $TIME && stat -c "%s" $FILE
0

[root@vm1 fuse_mnt]# TIME=1
[root@vm1 fuse_mnt]# truncate -s 52 $FILE && sleep $TIME && stat -c "%s" $FILE
52

I also found that if you disable 'readdir-ahead' translator, you get the correct size even if you mount without attribute-timeout=0 and entry-timeout=0.

Comment 6 zhou lin 2019-11-25 06:08:38 UTC
hi, i've test mouting with options, this works, the coredump issue is gone.
however, i am wondering why in glusterfs3.12.15 there is no such issue even when we set cluster.consisntent-metadata to "on" ?
from user pov, we usually do not konw use extra mount options, so this seems like a bug for me.

Comment 7 Kaleb KEITHLEY 2019-11-25 11:33:15 UTC
The commmon-ha component is for the ganesha HA bits.

Comment 8 Ravishankar N 2019-11-28 06:37:31 UTC
(In reply to Ravishankar N from comment #5)

So the reason it is not happening in 3.12.x is because it did not have the function rda_inode_ctx_update_iatts() introduced by the commit "*c9bde3021 - performance/readdir-ahead: keep stats of cached dentries in sync with modifications"

What is happening is this:
-When consistent-metadata is enabled, afr calls afr_zero_fill_stat() in its truncate call back and sets ctime and nlink to zero in the postbuf. This behaviour is already there in glusterfs 3.12  This forces fuse to trigger a lookup which hits AFR (and thus the bricks) and the application gets the current ia_size when it does the stat.
-But in the current code-base, when ctime is zero, the function rda_inode_ctx_update_iatts() zeroes out everything else also in the postbuf except gfid and ia_type. This somehow is not causing FUSE to trigger a lookup.

Raghavendra, is there some field in the postbuf that we need to retain and not zero out (in addition to the gfid and ia_type) in rda_inode_ctx_update_iatts() so that fuse sends a lookup when application does a stat?

Comment 9 Ravishankar N 2019-11-28 06:41:18 UTC
Example postbuf values from truncate cbks as seen in gdb:

Sent by AFR to readdir-ahead:
---------------------------
$6 = {ia_flags = 6143, ia_ino = 10578400756749423401, ia_dev = 64768, ia_rdev = 0, ia_size = 52, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 4096, ia_blocks = 0, ia_atime = 1574661108, ia_mtime = 1574853910, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 650755047, ia_mtime_nsec = 928005039, ia_ctime_nsec = 928005039, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)",
  ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000', owner = {read = 1 '\001', write = 1 '\001', exec = 0 '\000'}, group = {read = 1 '\001', write = 0 '\000', exec = 0 '\000'}, other = {read = 1 '\001',
      write = 0 '\000', exec = 0 '\000'}}}
    
Sent by readdir-ahead to fuse-bridge:
-----------------------------
 $7 = {ia_flags = 0, ia_ino = 0, ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 0, ia_blocks = 0, ia_atime = 0, ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0, ia_mtime_nsec = 0,
  ia_ctime_nsec = 0, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)", ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000', sticky = 0 '\000',
    owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}     
    
Sent by fuse-bridge to the fuse-kernel:
-----------------------------------------
$9 = {ia_flags = 0, ia_ino = 0, ia_dev = 0, ia_rdev = 0, ia_size = 0, ia_nlink = 0, ia_uid = 0, ia_gid = 0, ia_blksize = 131072, ia_blocks = 0, ia_atime = 0, ia_mtime = 0, ia_ctime = 0, ia_btime = 0, ia_atime_nsec = 0,
  ia_mtime_nsec = 0, ia_ctime_nsec = 0, ia_btime_nsec = 0, ia_attributes = 0, ia_attributes_mask = 0, ia_gfid = "\004\343F\357\225sB\346\222\316\aq\005\305\323)", ia_type = IA_IFREG, ia_prot = {suid = 0 '\000', sgid = 0 '\000',
    sticky = 0 '\000', owner = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, group = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}, other = {read = 0 '\000', write = 0 '\000', exec = 0 '\000'}}}

Comment 10 Raghavendra G 2019-11-29 04:08:01 UTC
(In reply to Ravishankar N from comment #8)
> (In reply to Ravishankar N from comment #5)
> 
> So the reason it is not happening in 3.12.x is because it did not have the
> function rda_inode_ctx_update_iatts() introduced by the commit "*c9bde3021 -
> performance/readdir-ahead: keep stats of cached dentries in sync with
> modifications"
> 
> What is happening is this:
> -When consistent-metadata is enabled, afr calls afr_zero_fill_stat() in its
> truncate call back and sets ctime and nlink to zero in the postbuf. This
> behaviour is already there in glusterfs 3.12  This forces fuse to trigger a
> lookup which hits AFR (and thus the bricks) and the application gets the
> current ia_size when it does the stat.
> -But in the current code-base, when ctime is zero, the function
> rda_inode_ctx_update_iatts() zeroes out everything else also in the postbuf
> except gfid and ia_type. This somehow is not causing FUSE to trigger a
> lookup.
> 
> Raghavendra, is there some field in the postbuf that we need to retain and
> not zero out (in addition to the gfid and ia_type) in
> rda_inode_ctx_update_iatts() so that fuse sends a lookup when application
> does a stat?

comment #5 says things work fine if we wait for 1s which is incidentally 'attribute-timeout' default value. When we wrote that patch in readdir-ahead, we were expecting kernel to only refresh the 'entry' (path to inode resolution) without touching the 'attributes' (rest of the stat). It looks like something is broken there.

Just to get more details, when was the last stat sent (before zeroed out postbuf was sent)? Is the 1s you mentioned was,
1. after the lookup/readdirp response that returned zeroed out stat (except for gfid and type)?
2. after a valid lookup/readdirp/stat/fstat response that returned a completely valid stat (all fields are valid)?

A more correct way to make sure kernel forgets the inode and initiates a lookup (without being subjected to various timeouts like attribute-timeout, entry-timeout, md-cache-timeout) is to call inode_invalidate on the inode.

Comment 11 Raghavendra G 2019-11-29 04:11:58 UTC
> Just to get more details, when was the last stat sent (before zeroed out postbuf was sent)? Is the 1s you mentioned was,

... when was the last stat (with all fields valid) sent...

Comment 12 Ravishankar N 2019-11-29 04:55:09 UTC
(In reply to Raghavendra G from comment #10)
(In reply to Raghavendra G from comment #11)

> Just to get more details, when was the last stat (with all fields valid) sent?
> Is the 1s you mentioned was,
> 1. after the lookup/readdirp response that returned zeroed out stat (except
> for gfid and type)?
> 2. after a valid lookup/readdirp/stat/fstat response that returned a
> completely valid stat (all fields are valid)?

For the test "[root@vm1 fuse_mnt]# truncate -s 52 $FILE && stat -c "%s" $FILE", here are the fops and their counts from profile info:
Note: there is no readdir(p) involved.
With readdir-ahead on:
------------------------
1     RELEASE
1       FLUSH
2    FINODELK
1        OPEN
1   FTRUNCATE
2    FXATTROP
2      LOOKUP

With readdir-ahead disabled:
--------------------------

 1     RELEASE
 2    FINODELK
 1       FLUSH
 1        OPEN
 1   FTRUNCATE
 2    FXATTROP
 1    GETXATTR
 4      LOOKUP    

So with readdir-ahead disabled (i.e. rda_inode_ctx_update_iatts() is not involved), there are 2 extra lookups after the ftruncate that helps getting the correct size. 
With readdir ahead enabled, those 2 lookup do not seem to be coming which means the kernel is still considering the values returned during the truncate cbk as valid.

Comment 13 Worker Ant 2020-03-12 12:16:55 UTC
This bug is moved to https://github.com/gluster/glusterfs/issues/860, and will be tracked there from now on. Visit GitHub issues URL for further details