Bug 1751722

Summary:

Gluster fuse mount crashed during truncate

Product:

[Red Hat Storage] Red Hat Gluster Storage

Reporter:

bipin <bshetty>

Component:

rhhi

Assignee:

Krutika Dhananjay <kdhananj>

Status:

CLOSED ERRATA

QA Contact:

bipin <bshetty>

Severity:

high

Docs Contact:

Priority:

unspecified

Version:

rhhiv-1.6

CC:

amukherj, bshetty, bugs, godas, kdhananj, lbailey, nchilaka, nsoffer, rhs-bugs, sasundar, seamurph, vjuranek

Target Milestone:

---

Keywords:

Triaged, ZStream

Target Release:

RHHI-V 1.6.z Async Update

Flags:

bshetty: needinfo+

Hardware:

x86_64

OS:

Linux

Whiteboard:

Fixed In Version:

vdsm-4.30.33, ioprocess-1.3.0

Doc Type:

Enhancement

Doc Text:

The block size of the underlying file system can now be probed, which enables VDSM support of storage with 4096 byte block size.

Story Points:

---

Clone Of:

1751085

Clones:

1752739 (view as bug list)

Environment:

Last Closed:

2019-10-03 12:24:16 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

1753898, 1753901

Bug Blocks:

Attachments:

Description	Flags
Script simulating vdsm block size detection	none

Description bipin 2019-09-12 13:09:04 UTC

+++ This bug was initially created as a clone of Bug #1751085 +++

Description of problem:

Gluster fuse mount crashes in shard translator while truncating the file from a really high size (Exabytes) to a really low size.

See bt:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=tendrl25.lab.eng.blr.r'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f01260a2eca in shard_common_resolve_shards (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at shard.c:1030
1030	            local->inode_list[i] = inode_ref(res_inode);
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libselinux-2.5-14.1.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.2-13.el7_6.8.x86_64
(gdb) bt
#0  0x00007f01260a2eca in shard_common_resolve_shards (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at shard.c:1030
#1  0x00007f01260a3bfd in shard_refresh_internal_dir (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, type=type@entry=SHARD_INTERNAL_DIR_DOT_SHARD) at shard.c:1317
#2  0x00007f01260ad90d in shard_truncate_begin (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60) at shard.c:2596
#3  0x00007f01260b506d in shard_post_lookup_truncate_handler (frame=0x7f0104000a58, this=0x7f0118015b60) at shard.c:2659
#4  0x00007f01260a1f8b in shard_lookup_base_file_cbk (frame=0x7f0104000a58, cookie=<optimized out>, this=0x7f0118015b60, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=<optimized out>, buf=0x7f0104013bb0, xdata=0x7f010c00faf8, postparent=0x7f0104013c48) at shard.c:1702
#5  0x00007f01265e4922 in afr_discover_unwind (frame=0x7f010400bd28, this=<optimized out>) at afr-common.c:3011
#6  0x00007f01265e4eeb in afr_discover_done (frame=<optimized out>, this=<optimized out>) at afr-common.c:3106
#7  0x00007f01265f27fd in afr_lookup_metadata_heal_check (frame=frame@entry=0x7f010400bd28, this=this@entry=0x7f0118010da0) at afr-common.c:2761
#8  0x00007f01265f3608 in afr_discover_cbk (frame=frame@entry=0x7f010400bd28, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f011805b9d8, buf=buf@entry=0x7f011f7fcb40, xdata=0x7f010c00faf8, postparent=postparent@entry=0x7f011f7fcbe0) at afr-common.c:3147
#9  0x00007f012687c412 in client4_0_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f010c010838) at client-rpc-fops_v2.c:2641
#10 0x00007f012ee6c021 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f01180510c0, pollin=pollin@entry=0x7f010c001a20) at rpc-clnt.c:755
#11 0x00007f012ee6c387 in rpc_clnt_notify (trans=0x7f0118051380, mydata=0x7f01180510f0, event=<optimized out>, data=0x7f010c001a20) at rpc-clnt.c:922
#12 0x00007f012ee689f3 in rpc_transport_notify (this=this@entry=0x7f0118051380, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f010c001a20) at rpc-transport.c:542
#13 0x00007f0129778875 in socket_event_poll_in (notify_handled=true, this=0x7f0118051380) at socket.c:2522
#14 socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f0118051380, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, 
    event_thread_died=0 '\000') at socket.c:2924
#15 0x00007f012f126806 in event_dispatch_epoll_handler (event=0x7f011f7fd130, event_pool=0x55cbb4dc2300) at event-epoll.c:648
#16 event_dispatch_epoll_worker (data=0x7f0118049a60) at event-epoll.c:761
#17 0x00007f012df01dd5 in start_thread (arg=0x7f011f7fe700) at pthread_create.c:307
#18 0x00007f012d7c902d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p local->first_block
$1 = 0
(gdb) p local->last_block 
$2 = -1
(gdb) p local->prebuf.ia_size
$3 = 18446744073709547520
(gdb) p local->num_blocks
$4 = 0
(gdb) p local->block_size 
$5 = 67108864
(gdb) p (local->prebuf.ia_size - 1)/local->block_size
$6 = 274877906943
(gdb) p (int) $6
$7 = -1




Turns out the quotient resulting from division of a really high unsigned 64 int to a relatively low unsigned 64 int is assigned to a signed int32 variable. And sometimes this quotient is larger than the largest signed int 32. In this case, local->last_block gets assigned a '-1' and the resulting local->num_blocks after that becomes 0. This leads to a GF_CALLOC with size 0 of local->inode_list[] here in shard_truncate_begin():

2580	    local->inode_list = GF_CALLOC(local->num_blocks, sizeof(inode_t *),
2581	                                  gf_shard_mt_inode_list);
2582	    if (!local->inode_list)
2583	        goto err;
2584	

When the members of local->inode_list[] beyond its boundary are accessed, there is illegal memory access and the process crashes.

While this explains the cause of the crash, what it doesn't explain is why the size of the file was as big as 16Eb when the writes on the file in question wouldn't extend its size beyond 4k. That is still being investigated. I'll update the bz once I have the RC.

Version-Release number of selected component (if applicable):

master

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from bipin on 2019-09-12 06:43:23 UTC ---

Seeing a similar crash with below steps:

1. Complete the RHHI-V deployment
2. Create 30 vm's using pool 
3. After an hour or so could see the FUSE mount crash


(gdb) bt
    #0  0x00007ff30b541a1a in shard_common_resolve_shards (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, post_res_handler=0x7ff30b54b2d0 <shard_post_resolve_truncate_handler>) at shard.c:1030
    #1  0x00007ff30b5425e5 in shard_refresh_internal_dir (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, type=type@entry=SHARD_INTERNAL_DIR_DOT_SHARD) at shard.c:1321
    #2  0x00007ff30b54b46e in shard_truncate_begin (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470) at shard.c:2573
    #3  0x00007ff30b551cc8 in shard_post_lookup_truncate_handler (frame=0x7ff2d400eba8, this=0x7ff304013470) at shard.c:2637
    #4  0x00007ff30b5409f2 in shard_lookup_base_file_cbk (frame=0x7ff2d400eba8, cookie=<optimized out>, this=0x7ff304013470, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>,
        buf=0x7ff2d402f700, xdata=0x7ff2fc02fd48, postparent=0x7ff2d402f9a0) at shard.c:1705
    #5  0x00007ff30b7a8381 in dht_discover_complete (this=this@entry=0x7ff304011cc0, discover_frame=discover_frame@entry=0x7ff2d400c028) at dht-common.c:754
    #6  0x00007ff30b7a92d4 in dht_discover_cbk (frame=0x7ff2d400c028, cookie=0x7ff30400f330, this=0x7ff304011cc0, op_ret=<optimized out>, op_errno=117, inode=0x7ff2fc00c578, stbuf=0x7ff2d400aba0,
        xattr=0x7ff2fc02fd48, postparent=0x7ff2d400ac10) at dht-common.c:1133
    #7  0x00007ff30ba61315 in afr_discover_done (frame=0x7ff2d4068fe8, this=<optimized out>) at afr-common.c:3027
    #8  0x00007ff30ba6c175 in afr_lookup_metadata_heal_check (frame=frame@entry=0x7ff2d4068fe8, this=this@entry=0x7ff30400f330) at afr-common.c:2769
    #9  0x00007ff30ba6d089 in afr_discover_cbk (frame=frame@entry=0x7ff2d4068fe8, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=inode@entry=0x7ff2fc00c578,
        buf=buf@entry=0x7ff30c6e3900, xdata=0x7ff2fc0166e8, postparent=postparent@entry=0x7ff30c6e3970) at afr-common.c:3077
    #10 0x00007ff30bcacf3d in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7ff2d40180a8) at client-rpc-fops.c:2872
    #11 0x00007ff313a90ac0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7ff304049ac0, pollin=pollin@entry=0x7ff2fc00b880) at rpc-clnt.c:778
    #12 0x00007ff313a90e2b in rpc_clnt_notify (trans=<optimized out>, mydata=0x7ff304049af0, event=<optimized out>, data=0x7ff2fc00b880) at rpc-clnt.c:971
    #13 0x00007ff313a8cba3 in rpc_transport_notify (this=this@entry=0x7ff304049e80, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7ff2fc00b880) at rpc-transport.c:557
    #14 0x00007ff30eba55e6 in socket_event_poll_in (this=this@entry=0x7ff304049e80, notify_handled=<optimized out>) at socket.c:2322
    #15 0x00007ff30eba7c2a in socket_event_handler (fd=11, idx=2, gen=4, data=0x7ff304049e80, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2482
    #16 0x00007ff313d498b0 in event_dispatch_epoll_handler (event=0x7ff30c6e3e70, event_pool=0x555e43424750) at event-epoll.c:643
    #17 event_dispatch_epoll_worker (data=0x555e4347fca0) at event-epoll.c:759
    #18 0x00007ff312b26ea5 in start_thread (arg=0x7ff30c6e4700) at pthread_create.c:307
    #19 0x00007ff3123ec8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) f 0 
#0  0x00007ff30b541a1a in shard_common_resolve_shards (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, post_res_handler=0x7ff30b54b2d0 <shard_post_resolve_truncate_handler>) at shard.c:1030
1030	                        local->inode_list[i] = inode_ref (res_inode);
(gdb) p local->num_blocks
 $1 = 0



--- Additional comment from Worker Ant on 2019-09-12 08:46:16 UTC ---

REVIEW: https://review.gluster.org/23407 (features/shard: Convert shard block indices to uint64) posted (#1) for review on master by Krutika Dhananjay

Comment 3 Sahina Bose 2019-09-17 06:08:36 UTC

Nir, this crash is seen with the 4.3.6. Has anything changed in the way that the __DIRECT_IO_TEST__ file is updated?

Comment 4 bipin 2019-09-17 06:49:59 UTC

Clearing the needinfo

Comment 5 Krutika Dhananjay 2019-09-18 15:55:59 UTC

RCA:
The size going negative is when two consecutive truncates on the __DIRECT_IO_TEST__ file (coming from open with O_TRUNC) happened in the following sequence:

1. Size of the file at the beginning - 512b
2. First truncate on a given mount truncated the file to size 0. Delta size = final size - initial size = 0 - 512 = -512.
3. Xattrop is now sent with -512. And file size had been 512. So 512 + (-512) = 0. Final on-disk size at the end of this truncate is 0.
But shard translator in the truncate fop callback continues to cache 512 as the file size.
4. Then a second truncate (again to size 0) is sent, without a lookup or stat preceding it. So the size in cache is believed to be true.
Delta size = final size - initial size = 0 - 512 = -512. (here initial size should have been 0 but it is wrongly assumed to be 512).
So an xattrop is sent with -512. So 0 - 512 = 0xfffffffffffffe00

And this is what we see in the getfattr output of the file:

[root@rhsqa-grafton8 ~]# getfattr -d -m . -e hex /gluster_bricks/data/data/__DIRECT_IO_TEST__
getfattr: Removing leading '/' from absolute path names
# file: gluster_bricks/data/data/__DIRECT_IO_TEST__
security.selinux=0x73797374656d5f753a6f626a6563745f723a676c7573746572645f627269636b5f743a733000
trusted.afr.dirty=0x000000000000000000000000
trusted.gfid=0xe8235ee2ea374f0a9e54853db2781f93
trusted.gfid2path.69693d0e1876710b=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f5f5f4449524543545f494f5f544553545f5f
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0xfffffffffffffe00000000000000000000000000000000010000000000000000

More evidence in wireshark output. Will paste it here tomorrow for the sake of completeness (my laptop's freezing at some point when i try to load the pcap file which is huge in wireshark)

NOTE:
+++++

Part of the reason why we hit this crash now is because truncate codepath in shard had largely remained untested. With the newly introduced block-size detection code in rhv/ovirt, which primarily opens the file __DIRECT_IO_TEST__ with O_TRUNC and performs io on it, the truncate fop path is now getting exercised frequently.

-Krutika

Comment 6 Nir Soffer 2019-09-18 17:03:27 UTC

I think what we do is easy to reproduce like this:

   for bs in 1 512 4096; do
       dd if=/dev/zero of=/path/to/testfile bs=$bs count=1 oflag=direct conv=fsync
   done

We stop when dd succeeds, but I don't think it matter for testing gluster.

Comment 7 Nir Soffer 2019-09-18 17:06:54 UTC

(In reply to Sahina Bose from comment #3)
> Nir, this crash is seen with the 4.3.6. Has anything changed in the way that
> the __DIRECT_IO_TEST__ file is updated?

Yes, this file was always empty before 4.3.6. We used it like this:

    dd if=/dev/zero of=/path bs=4096 count=0 oflag=direct conv=fsync

Verifying that we can do open() with O_DIRECT on the filesystem.

Now we actually so some I/O to detect the capabilities of the filesystem
as explained in comment 6.

Comment 8 Nir Soffer 2019-09-18 17:14:46 UTC

Krutika, it seems that the new code for detecting block size in 4.3.6
is triggering a crash in current gluster version, so I think we should
disable it in  oVirt 4.3.6 until a gluster fix is released.

What do you think?

Comment 9 Krutika Dhananjay 2019-09-19 02:00:50 UTC

(In reply to Nir Soffer from comment #8)
> Krutika, it seems that the new code for detecting block size in 4.3.6
> is triggering a crash in current gluster version, so I think we should
> disable it in  oVirt 4.3.6 until a gluster fix is released.
> 
> What do you think?

Yes, this sounds like the right thing to do.

Comment 10 Nir Soffer 2019-09-19 16:58:00 UTC

(In reply to Krutika Dhananjay from comment #5)
> RCA:
> The size going negative is when two consecutive truncates on the
> __DIRECT_IO_TEST__ file (coming from open with O_TRUNC) happened in the
> following sequence:
> 
> 1. Size of the file at the beginning - 512b

Did you reproduce this with storage with sector size of 512 or 4096?

On 4k storage I don't think we can have such file, unless there is 
another bug in gluster.

What vdsm does is equivalent to:

out=$1

for bs in 1 512 4096; do
    if dd if=/dev/zero bs=$bs count=1 of=$out oflag=direct conv=fsync; then
        echo $bs
        exit 0
    fi
done

echo "cannot use direct I/O" > &2
exit 1

On storage that does not support direct I/O the result is:

$ sh detect-block-size.sh /tmp/foo
dd: failed to open '/tmp/foo': Invalid argument
dd: failed to open '/tmp/foo': Invalid argument
dd: failed to open '/tmp/foo': Invalid argument
cannot use direct I/O

$ ls -lht /tmp/foo 
-rw-rw-r--. 1 nsoffer nsoffer 0 Sep 19 19:46 /tmp/foo

On storage with sector size of 512:

$ sh detect-block-size.sh /var/tmp/foo
dd: error writing '/var/tmp/foo': Invalid argument
1+0 records in
0+0 records out
0 bytes copied, 0.000392325 s, 0.0 kB/s
1+0 records in
1+0 records out
512 bytes copied, 0.0601933 s, 8.5 kB/s
512

$ ls -lht /var/tmp/foo
-rw-rw-r--. 1 nsoffer nsoffer 512 Sep 19 19:48 /var/tmp/foo

On storage with sector size of 4096:

$ sh detect-block-size.sh /var/tmp/vdsm-storage/mount.mount-4k/foo
dd: error writing '/var/tmp/vdsm-storage/mount.mount-4k/foo': Invalid argument
1+0 records in
0+0 records out
0 bytes copied, 0.000116669 s, 0.0 kB/s
dd: error writing '/var/tmp/vdsm-storage/mount.mount-4k/foo': Invalid argument
1+0 records in
0+0 records out
0 bytes copied, 0.000115637 s, 0.0 kB/s
1+0 records in
1+0 records out
4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00483159 s, 848 kB/s
4096

Comment 11 Nir Soffer 2019-09-19 17:24:31 UTC

Created attachment 1616830 [details]
Script simulating vdsm block size detection

Comment 12 Krutika Dhananjay 2019-09-20 08:05:06 UTC

(In reply to Nir Soffer from comment #10)
> (In reply to Krutika Dhananjay from comment #5)
> > RCA:
> > The size going negative is when two consecutive truncates on the
> > __DIRECT_IO_TEST__ file (coming from open with O_TRUNC) happened in the
> > following sequence:
> > 
> > 1. Size of the file at the beginning - 512b
> 
> Did you reproduce this with storage with sector size of 512 or 4096?

Just checked Bipin's setup and it's 512.

[root@rhsqa-grafton8 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_vmstore
512
[root@rhsqa-grafton8 ~]# blockdev --getss /dev/mapper/gluster_vg_sdb-gluster_lv_data 
512
[root@rhsqa-grafton8 ~]# blockdev --getss /dev/mapper/gluster_vg_sdd-gluster_lv_engine 
512

> 
> On 4k storage I don't think we can have such file, unless there is 
> another bug in gluster.
> 
> What vdsm does is equivalent to:
> 
> out=$1
> 
> for bs in 1 512 4096; do
>     if dd if=/dev/zero bs=$bs count=1 of=$out oflag=direct conv=fsync; then
>         echo $bs
>         exit 0
>     fi
> done
> 
> echo "cannot use direct I/O" > &2
> exit 1
> 
> On storage that does not support direct I/O the result is:
> 
> $ sh detect-block-size.sh /tmp/foo
> dd: failed to open '/tmp/foo': Invalid argument
> dd: failed to open '/tmp/foo': Invalid argument
> dd: failed to open '/tmp/foo': Invalid argument
> cannot use direct I/O
> 
> $ ls -lht /tmp/foo 
> -rw-rw-r--. 1 nsoffer nsoffer 0 Sep 19 19:46 /tmp/foo
> 
> On storage with sector size of 512:
> 
> $ sh detect-block-size.sh /var/tmp/foo
> dd: error writing '/var/tmp/foo': Invalid argument
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.000392325 s, 0.0 kB/s
> 1+0 records in
> 1+0 records out
> 512 bytes copied, 0.0601933 s, 8.5 kB/s
> 512
> 
> $ ls -lht /var/tmp/foo
> -rw-rw-r--. 1 nsoffer nsoffer 512 Sep 19 19:48 /var/tmp/foo
> 
> On storage with sector size of 4096:
> 
> $ sh detect-block-size.sh /var/tmp/vdsm-storage/mount.mount-4k/foo
> dd: error writing '/var/tmp/vdsm-storage/mount.mount-4k/foo': Invalid
> argument
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.000116669 s, 0.0 kB/s
> dd: error writing '/var/tmp/vdsm-storage/mount.mount-4k/foo': Invalid
> argument
> 1+0 records in
> 0+0 records out
> 0 bytes copied, 0.000115637 s, 0.0 kB/s
> 1+0 records in
> 1+0 records out
> 4096 bytes (4.1 kB, 4.0 KiB) copied, 0.00483159 s, 848 kB/s
> 4096

Comment 13 Krutika Dhananjay 2019-09-20 08:56:49 UTC

I'm able to recreate the bug now with some simple steps.



[root@dhcp35-215 ~]# gluster v info
 
Volume Name: rep
Type: Replicate
Volume ID: 8cad61f0-4770-4e75-b97c-7bab6cb0fa67
Status: Started
Snapshot Count: 0
Number of Bricks: 1 x 3 = 3
Transport-type: tcp
Bricks:
Brick1: kdhananjay:/bricks/1
Brick2: kdhananjay:/bricks/2
Brick3: kdhananjay:/bricks/3
Options Reconfigured:
performance.strict-o-direct: on
server.event-threads: 4
client.event-threads: 4
cluster.choose-local: off
user.cifs: off
features.shard: on
cluster.shd-wait-qlength: 10000
cluster.shd-max-threads: 8
cluster.locking-scheme: granular
cluster.data-self-heal-algorithm: full
cluster.server-quorum-type: server
cluster.quorum-type: auto
cluster.eager-lock: enable
network.remote-dio: off
performance.low-prio-threads: 32
performance.io-cache: off
performance.read-ahead: off
performance.quick-read: off
transport.address-family: inet
nfs.disable: on
performance.client-io-threads: on


SET MD_CACHE_TIMEOUT TO A SLIGHTLY HIGH VALUE TO SIMULATE THE RACE:
# gluster volume set rep md-cache-timeout 60

CREATE TWO MOUNTS:
# mount -t glusterfs kdhananjay:/rep /mnt
# mount -t glusterfs kdhananjay:/rep /mnt1

FILL A FILE WITH DATA UPTO 512b
# dd if=/dev/urandom of=/mnt/__DIRECT_IO_TEST__ bs=512 count=1 oflag=direct

STAT THE FILE FROM FIRST MOUNT TO ENSURE 512b IS CACHED AS FILE SIZE IN-MEMORY, NOW TRUNCATE THE FILE TO SIZE 0 FROM SECOND MOUNT; THEN TRUNCATE THE FILE TO 0 FROM THE FIRST MOUNT
# stat /mnt/__DIRECT_IO_TEST__; truncate -s 0 /mnt1/__DIRECT_IO_TEST__; truncate -s 0 /mnt/__DIRECT_IO_TEST__


THE RESULT?
[root@dhcp35-215 ~]# getfattr -d -m . -e hex /bricks/1/__DIRECT_IO_TEST__                                                                                                                     
getfattr: Removing leading '/' from absolute path names                                                                                                                                       
# file: bricks/1/__DIRECT_IO_TEST__                                                                                                                                                           
security.selinux=0x756e636f6e66696e65645f753a6f626a6563745f723a6574635f72756e74696d655f743a733000                                                                                             
trusted.afr.dirty=0x000000000000000000000000                                                                                                                                                  
trusted.gfid=0x3d79268080c241d9bf70039c7ce52c54                                                                                                                                               
trusted.gfid2path.69693d0e1876710b=0x30303030303030302d303030302d303030302d303030302d3030303030303030303030312f5f5f4449524543545f494f5f544553545f5f
trusted.glusterfs.shard.block-size=0x0000000004000000
trusted.glusterfs.shard.file-size=0xfffffffffffffe00000000000000000000000000000000000000000000000000

^^ Size is now negative.

Here's what could have happened from VDSM's perspective in the actual rhhi setup:

1. HOST 1 would have successfully performed 512b write (perhaps followed by a lookup or a stat) and cached 512b as the file size.
2. Now HOST 2 would do the same open + write, wherein the open() part has completed so far with O_TRUNC. SO file size in backend has become 0 due to truncate.
3. Now HOST 1 would try the same open + write for the next block size, but size has changed to 0 already unbeknownst to this mount. It does the truncate (because O_TRUNC) and uses cached value to compute delta. So zero minus 512b ends up being negative 512 in the backend.

Basically there is an interleaving of sub-operations of detect_block_size across multiple mount points. By sub-operations i mean open + truncate + write

Comment 14 Nir Soffer 2019-09-20 17:04:05 UTC

Krutika, this makes sense, and explaining the randomness of the failure.

Can you confirm that using a unique file for every host will avoids this
issue?

I'm thinking about this flow:

1. Generate random filename e.g ".prob-cacb3ecf-9445-4638-aa00-af0ae99dc09e"
2. Open the prob file with O_WRONLY |O_DIRECT | O_CREAT | O_SYNC | O_EXCL | O_CLOEXEC
3. Delete the file
4. Use the open fd to detect block size

So the probe file can be accessed only from single host, and we never
truncate it.

Do you see issue with gluster in particular when using shard translator?

Does gluster support O_TMPFILE? this will simplify steps 1-3 to:

    fd = open("/mount/path", O_TMPFILE | O_WRONLY | O_EXCL | O_DIRECT);

Comment 15 Krutika Dhananjay 2019-09-22 12:40:04 UTC

(In reply to Nir Soffer from comment #14)
> Krutika, this makes sense, and explaining the randomness of the failure.
> 
> Can you confirm that using a unique file for every host will avoids this
> issue?
> 

Yes, using a unique file for every host would ensure that the cached values in memory are correct
and are not modified by a different client.


> I'm thinking about this flow:
> 
> 1. Generate random filename e.g ".prob-cacb3ecf-9445-4638-aa00-af0ae99dc09e"
> 2. Open the prob file with O_WRONLY |O_DIRECT | O_CREAT | O_SYNC | O_EXCL |
> O_CLOEXEC
> 3. Delete the file
> 4. Use the open fd to detect block size
> 
> So the probe file can be accessed only from single host, and we never
> truncate it.
> 
> Do you see issue with gluster in particular when using shard translator?
> 
> Does gluster support O_TMPFILE? this will simplify steps 1-3 to:
> 
>     fd = open("/mount/path", O_TMPFILE | O_WRONLY | O_EXCL | O_DIRECT);

I think it should work, though I haven't tested it.

-Krutika

Comment 16 Nir Soffer 2019-09-22 12:47:29 UTC

(In reply to Krutika Dhananjay from comment #15)
> (In reply to Nir Soffer from comment #14)
> > Does gluster support O_TMPFILE? this will simplify steps 1-3 to:
> > 
> >     fd = open("/mount/path", O_TMPFILE | O_WRONLY | O_EXCL | O_DIRECT);
> 
> I think it should work, though I haven't tested it.

open() fails with ENOTSUP and I could not find the error in /var/log/glusterfs,
so I guess this is not supported in fuse kernel side.
https://sourceforge.net/p/fuse/mailman/message/33646980/

Comment 17 SATHEESARAN 2019-09-23 11:07:35 UTC

This issue was reproduced 2 times out of many times ( ~40 times )
Nir (oVirt dev) and Krutika helped to RCA this issue.

This issue is seen because of recent changes in oVirt ( detection of storage block size )
lead to discovering a day1 bug with sharding respect to truncate ( Apologies, not well explained !)

If the fixes are not in place, then it would lead to crash most of the times at the field.
As it failed twice in QE env, the issue has got more probability to be hit in customer environment.

Marking this bug as BLOCKER for RHV 4.3.6

Corresponding RHV bugs that fixes the issues are:
https://bugzilla.redhat.com/show_bug.cgi?id=1753901
https://bugzilla.redhat.com/show_bug.cgi?id=1753898

Comment 18 SATHEESARAN 2019-09-25 07:49:57 UTC

One of the dependent bug is ON_QA and other is on MODIFIED state. Moving this bug to MODIFIED too

Comment 19 SATHEESARAN 2019-09-25 14:36:08 UTC

The dependent bugs are now ON_QA. moving this bug also to ON_QA

Comment 20 bipin 2019-09-28 05:15:20 UTC

Moving the bug to verified based on the below results:

Version:
=======
glusterfs-3.12.2-47.5.el7rhgs.x86_64
ioprocess-1.3.0-1.el7ev.x86_64
vdsm-4.30.33-1.el7ev.x86_64


1. Complete the RHHI-V deployment
2. Create a RHEL 7.6 template
3. Create 30 vm's using pool 
4. After an hour delete the vm's in the pool
5. Repeat step 3 and step 4 for 3 iteration's 
6. Create vm's using template and ran kernel untar and FIO 

Couldn't see any crash during the test. So moving the bug to verified

Comment 21 SATHEESARAN 2019-10-01 01:03:56 UTC

@Laura,

This crash is seen only with RHV 4.3.6 internal builds.
This crash was not reported with the previously released version,
so release doc_text is not required for this bug

Comment 24 errata-xmlrpc 2019-10-03 12:24:16 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:2963