Bug 1753901 - ioprocess - Implement block size detection compatible with Gluster storage
Summary: ioprocess - Implement block size detection compatible with Gluster storage
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Virtualization Manager
Classification: Red Hat
Component: ioprocess
Version: 4.3.6
Hardware: Unspecified
OS: Unspecified
urgent
urgent
Target Milestone: ovirt-4.3.6
: 4.3.6
Assignee: Nir Soffer
QA Contact: bipin
URL:
Whiteboard:
Depends On:
Blocks: 1751722 1753898
TreeView+ depends on / blocked
 
Reported: 2019-09-20 08:06 UTC by Sahina Bose
Modified: 2019-10-10 15:39 UTC (History)
7 users (show)

Fixed In Version: ioprocess-1.3.0
Doc Type: Enhancement
Doc Text:
The current release provides an API to probe the block size of the underlying filesystem. The vdsm package needs this API to support 4k storage on gluster.
Clone Of:
Environment:
Last Closed: 2019-10-10 15:39:40 UTC
oVirt Team: Storage
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
stress test for verifying with gluster storage (507 bytes, text/plain)
2019-09-21 00:36 UTC, Nir Soffer
no flags Details
stress test for verifying with gluster storage - improved (592 bytes, text/plain)
2019-09-22 17:10 UTC, Nir Soffer
no flags Details
yum repo for testing ioprocess 1.3.0-1 build (149 bytes, text/plain)
2019-09-22 17:11 UTC, Nir Soffer
no flags Details
ioprocess trace showing extremely slow unaligned writes (225.37 KB, text/plain)
2019-09-22 17:35 UTC, Nir Soffer
no flags Details
reproduce script for vdsm <= 4.3.6 (444 bytes, text/plain)
2019-09-23 06:26 UTC, Nir Soffer
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2019:3030 0 None None None 2019-10-10 15:39:47 UTC
oVirt gerrit 103468 0 'None' MERGED ioprocess: Introduce probe_block_size 2020-04-08 02:39:49 UTC
oVirt gerrit 103470 0 'None' MERGED test: Add probing stress test 2020-04-08 02:39:49 UTC
oVirt gerrit 103471 0 'None' MERGED spec: Bump version to 1.3.0 2020-04-08 02:39:49 UTC

Description Sahina Bose 2019-09-20 08:06:01 UTC
This bug was initially created as a copy of Bug #1751722

I am copying this bug because the changes introduced in 4.3.6 for 4K block size detection causes a fuse mount crash in gluster.



Description of problem:

Gluster fuse mount crashes in shard translator while truncating the file from a really high size (Exabytes) to a really low size.

See bt:
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/usr/sbin/glusterfs --process-name fuse --volfile-server=tendrl25.lab.eng.blr.r'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f01260a2eca in shard_common_resolve_shards (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at shard.c:1030
1030	            local->inode_list[i] = inode_ref(res_inode);
Missing separate debuginfos, use: debuginfo-install keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64 libgcc-4.8.5-36.el7_6.2.x86_64 libselinux-2.5-14.1.el7.x86_64 pcre-8.32-17.el7.x86_64 sssd-client-1.16.2-13.el7_6.8.x86_64
(gdb) bt
#0  0x00007f01260a2eca in shard_common_resolve_shards (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, 
    post_res_handler=0x7f01260ad770 <shard_post_resolve_truncate_handler>) at shard.c:1030
#1  0x00007f01260a3bfd in shard_refresh_internal_dir (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60, type=type@entry=SHARD_INTERNAL_DIR_DOT_SHARD) at shard.c:1317
#2  0x00007f01260ad90d in shard_truncate_begin (frame=frame@entry=0x7f0104000a58, this=this@entry=0x7f0118015b60) at shard.c:2596
#3  0x00007f01260b506d in shard_post_lookup_truncate_handler (frame=0x7f0104000a58, this=0x7f0118015b60) at shard.c:2659
#4  0x00007f01260a1f8b in shard_lookup_base_file_cbk (frame=0x7f0104000a58, cookie=<optimized out>, this=0x7f0118015b60, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=<optimized out>, buf=0x7f0104013bb0, xdata=0x7f010c00faf8, postparent=0x7f0104013c48) at shard.c:1702
#5  0x00007f01265e4922 in afr_discover_unwind (frame=0x7f010400bd28, this=<optimized out>) at afr-common.c:3011
#6  0x00007f01265e4eeb in afr_discover_done (frame=<optimized out>, this=<optimized out>) at afr-common.c:3106
#7  0x00007f01265f27fd in afr_lookup_metadata_heal_check (frame=frame@entry=0x7f010400bd28, this=this@entry=0x7f0118010da0) at afr-common.c:2761
#8  0x00007f01265f3608 in afr_discover_cbk (frame=frame@entry=0x7f010400bd28, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, 
    inode=inode@entry=0x7f011805b9d8, buf=buf@entry=0x7f011f7fcb40, xdata=0x7f010c00faf8, postparent=postparent@entry=0x7f011f7fcbe0) at afr-common.c:3147
#9  0x00007f012687c412 in client4_0_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7f010c010838) at client-rpc-fops_v2.c:2641
#10 0x00007f012ee6c021 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7f01180510c0, pollin=pollin@entry=0x7f010c001a20) at rpc-clnt.c:755
#11 0x00007f012ee6c387 in rpc_clnt_notify (trans=0x7f0118051380, mydata=0x7f01180510f0, event=<optimized out>, data=0x7f010c001a20) at rpc-clnt.c:922
#12 0x00007f012ee689f3 in rpc_transport_notify (this=this@entry=0x7f0118051380, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7f010c001a20) at rpc-transport.c:542
#13 0x00007f0129778875 in socket_event_poll_in (notify_handled=true, this=0x7f0118051380) at socket.c:2522
#14 socket_event_handler (fd=<optimized out>, idx=<optimized out>, gen=<optimized out>, data=0x7f0118051380, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, 
    event_thread_died=0 '\000') at socket.c:2924
#15 0x00007f012f126806 in event_dispatch_epoll_handler (event=0x7f011f7fd130, event_pool=0x55cbb4dc2300) at event-epoll.c:648
#16 event_dispatch_epoll_worker (data=0x7f0118049a60) at event-epoll.c:761
#17 0x00007f012df01dd5 in start_thread (arg=0x7f011f7fe700) at pthread_create.c:307
#18 0x00007f012d7c902d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
(gdb) p local->first_block
$1 = 0
(gdb) p local->last_block 
$2 = -1
(gdb) p local->prebuf.ia_size
$3 = 18446744073709547520
(gdb) p local->num_blocks
$4 = 0
(gdb) p local->block_size 
$5 = 67108864
(gdb) p (local->prebuf.ia_size - 1)/local->block_size
$6 = 274877906943
(gdb) p (int) $6
$7 = -1




Turns out the quotient resulting from division of a really high unsigned 64 int to a relatively low unsigned 64 int is assigned to a signed int32 variable. And sometimes this quotient is larger than the largest signed int 32. In this case, local->last_block gets assigned a '-1' and the resulting local->num_blocks after that becomes 0. This leads to a GF_CALLOC with size 0 of local->inode_list[] here in shard_truncate_begin():

2580	    local->inode_list = GF_CALLOC(local->num_blocks, sizeof(inode_t *),
2581	                                  gf_shard_mt_inode_list);
2582	    if (!local->inode_list)
2583	        goto err;
2584	

When the members of local->inode_list[] beyond its boundary are accessed, there is illegal memory access and the process crashes.

While this explains the cause of the crash, what it doesn't explain is why the size of the file was as big as 16Eb when the writes on the file in question wouldn't extend its size beyond 4k. That is still being investigated. I'll update the bz once I have the RC.

Version-Release number of selected component (if applicable):

master

How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

--- Additional comment from bipin on 2019-09-12 06:43:23 UTC ---

Seeing a similar crash with below steps:

1. Complete the RHHI-V deployment
2. Create 30 vm's using pool 
3. After an hour or so could see the FUSE mount crash


(gdb) bt
    #0  0x00007ff30b541a1a in shard_common_resolve_shards (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, post_res_handler=0x7ff30b54b2d0 <shard_post_resolve_truncate_handler>) at shard.c:1030
    #1  0x00007ff30b5425e5 in shard_refresh_internal_dir (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, type=type@entry=SHARD_INTERNAL_DIR_DOT_SHARD) at shard.c:1321
    #2  0x00007ff30b54b46e in shard_truncate_begin (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470) at shard.c:2573
    #3  0x00007ff30b551cc8 in shard_post_lookup_truncate_handler (frame=0x7ff2d400eba8, this=0x7ff304013470) at shard.c:2637
    #4  0x00007ff30b5409f2 in shard_lookup_base_file_cbk (frame=0x7ff2d400eba8, cookie=<optimized out>, this=0x7ff304013470, op_ret=<optimized out>, op_errno=<optimized out>, inode=<optimized out>,
        buf=0x7ff2d402f700, xdata=0x7ff2fc02fd48, postparent=0x7ff2d402f9a0) at shard.c:1705
    #5  0x00007ff30b7a8381 in dht_discover_complete (this=this@entry=0x7ff304011cc0, discover_frame=discover_frame@entry=0x7ff2d400c028) at dht-common.c:754
    #6  0x00007ff30b7a92d4 in dht_discover_cbk (frame=0x7ff2d400c028, cookie=0x7ff30400f330, this=0x7ff304011cc0, op_ret=<optimized out>, op_errno=117, inode=0x7ff2fc00c578, stbuf=0x7ff2d400aba0,
        xattr=0x7ff2fc02fd48, postparent=0x7ff2d400ac10) at dht-common.c:1133
    #7  0x00007ff30ba61315 in afr_discover_done (frame=0x7ff2d4068fe8, this=<optimized out>) at afr-common.c:3027
    #8  0x00007ff30ba6c175 in afr_lookup_metadata_heal_check (frame=frame@entry=0x7ff2d4068fe8, this=this@entry=0x7ff30400f330) at afr-common.c:2769
    #9  0x00007ff30ba6d089 in afr_discover_cbk (frame=frame@entry=0x7ff2d4068fe8, cookie=<optimized out>, this=<optimized out>, op_ret=<optimized out>, op_errno=<optimized out>, inode=inode@entry=0x7ff2fc00c578,
        buf=buf@entry=0x7ff30c6e3900, xdata=0x7ff2fc0166e8, postparent=postparent@entry=0x7ff30c6e3970) at afr-common.c:3077
    #10 0x00007ff30bcacf3d in client3_3_lookup_cbk (req=<optimized out>, iov=<optimized out>, count=<optimized out>, myframe=0x7ff2d40180a8) at client-rpc-fops.c:2872
    #11 0x00007ff313a90ac0 in rpc_clnt_handle_reply (clnt=clnt@entry=0x7ff304049ac0, pollin=pollin@entry=0x7ff2fc00b880) at rpc-clnt.c:778
    #12 0x00007ff313a90e2b in rpc_clnt_notify (trans=<optimized out>, mydata=0x7ff304049af0, event=<optimized out>, data=0x7ff2fc00b880) at rpc-clnt.c:971
    #13 0x00007ff313a8cba3 in rpc_transport_notify (this=this@entry=0x7ff304049e80, event=event@entry=RPC_TRANSPORT_MSG_RECEIVED, data=data@entry=0x7ff2fc00b880) at rpc-transport.c:557
    #14 0x00007ff30eba55e6 in socket_event_poll_in (this=this@entry=0x7ff304049e80, notify_handled=<optimized out>) at socket.c:2322
    #15 0x00007ff30eba7c2a in socket_event_handler (fd=11, idx=2, gen=4, data=0x7ff304049e80, poll_in=<optimized out>, poll_out=<optimized out>, poll_err=0, event_thread_died=0 '\000') at socket.c:2482
    #16 0x00007ff313d498b0 in event_dispatch_epoll_handler (event=0x7ff30c6e3e70, event_pool=0x555e43424750) at event-epoll.c:643
    #17 event_dispatch_epoll_worker (data=0x555e4347fca0) at event-epoll.c:759
    #18 0x00007ff312b26ea5 in start_thread (arg=0x7ff30c6e4700) at pthread_create.c:307
    #19 0x00007ff3123ec8cd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

(gdb) f 0 
#0  0x00007ff30b541a1a in shard_common_resolve_shards (frame=frame@entry=0x7ff2d400eba8, this=this@entry=0x7ff304013470, post_res_handler=0x7ff30b54b2d0 <shard_post_resolve_truncate_handler>) at shard.c:1030
1030	                        local->inode_list[i] = inode_ref (res_inode);
(gdb) p local->num_blocks
 $1 = 0



--- Additional comment from Worker Ant on 2019-09-12 08:46:16 UTC ---

REVIEW: https://review.gluster.org/23407 (features/shard: Convert shard block indices to uint64) posted (#1) for review on master by Krutika Dhananjay

Comment 1 Nir Soffer 2019-09-20 21:19:39 UTC
According to https://bugzilla.redhat.com/show_bug.cgi?id=1751722#c13
the issue is caused by multiple hosts trying to detect block size 
using the same file (__DIRECT_IO_TEST__). This is expected to work on
posix filesystem but it breaks gluster metadata management.

Block size detection is implemented now in vdsm, so ioprocess does not
need any change. However implementing block size detection in a way which
is compatible with existing Gluster is best done in iprocess, so I'm using 
this bug for the fix in ioprocess.

Comment 2 Nir Soffer 2019-09-21 00:36:41 UTC
Created attachment 1617335 [details]
stress test for verifying with gluster storage

To verify:
- Install python2-ioprocess on all hosts
- Mount same gluster volume on all hosts
- Run attached test script on all hosts for couple of minutes

   $ python probe_block_size_stress.py /mount/path
   512
   512
   ...

Expected results:
- Script output the same block size on all hosts
- No errors in the scripts output
- No leftover hidden files in the gluster mountpoint
  (.e.g ".probe-xxx-yyy")
- Gluster fuse mount does not crash

Comment 6 Nir Soffer 2019-09-22 17:10:28 UTC
Created attachment 1617906 [details]
stress test for verifying with gluster storage - improved

This version flushes output so it is easy to run with nohup.

Comment 7 Nir Soffer 2019-09-22 17:11:12 UTC
Created attachment 1617907 [details]
yum repo for testing ioprocess 1.3.0-1 build

Comment 8 Nir Soffer 2019-09-22 17:28:15 UTC
I'm testing now on Avihai setup - 3 hosts connected to gluster storage.

On each host:
1. Copy ioproess.repo (attachment 1617907 [details]) to /etc/yum.repos.d/
2. yum upgrade ioprocess python2-ioprocess
3. Copy probe_block_size_stress.py (attachment 1617906 [details]) to /root/
4. Run

    nohup python probe_block_size_stress.py \
        /rhev/data-center/mnt/glusterSD/gluster01.exmpple.com:_storage__local__ge8__volume__0/ &

Started around 20:05, probing in tight loop - about 2 seconds per probe, 1800 calls per hour.
(this systems are extremly slow).

All hosts report correct block size.

In strace we see:

$ strace -f -tt -T -p 26494
[pid 26500] 20:21:13.893810 open("/rhev/data-center/mnt/glusterSD/gluster01.example.com:_storage__local__ge8__volume__0//.prob-f260033a-694c-4bdf-bcef-4b46f82306b9", O_WRONLY|O_CREAT|O_EXCL|O_DSYNC|O_DIRECT, 0600 <unfinished ...>
[pid 26500] 20:21:13.907069 <... open resumed> ) = 0 <0.013147>
[pid 26500] 20:21:13.907195 unlink("/rhev/data-center/mnt/glusterSD/gluster01.example.com:_storage__local__ge8__volume__0//.prob-f260033a-694c-4bdf-bcef-4b46f82306b9") = 0 <0.005863>
[pid 26500] 20:21:13.913227 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.930909>
[pid 26500] 20:21:15.844328 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.011338>

Krutika, does it make sense that writing 1 byte takes 1.93 seconds? 

Will run for couple of hours.

Comment 9 Nir Soffer 2019-09-22 17:35:15 UTC
Created attachment 1617908 [details]
ioprocess trace showing extremely slow unaligned writes

Krutika, should we file a bug about the slow unaligned writes?

# grep pwrite64 ioprocess.trace | head -20
26500 20:27:40.037730 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.009804>
26500 20:27:40.133382 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.911269>
26500 20:27:42.044969 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.012527>
26500 20:27:42.105923 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.940483>
26500 20:27:44.046644 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.028278>
26500 20:27:44.162981 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.883644>
26500 20:27:46.046896 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.012689>
26500 20:27:46.135306 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.912911>
26500 20:27:48.048518 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.011936>
26500 20:27:48.105805 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.952041>
26500 20:27:50.058156 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.011621>
26500 20:27:50.133522 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.922091>
26500 20:27:52.055897 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.009861>
26500 20:27:52.100258 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.951614>
26500 20:27:54.052096 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.015949>
26500 20:27:54.125715 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.928027>
26500 20:27:56.054076 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.016307>
26500 20:27:56.142113 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.914674>
26500 20:27:58.057043 pwrite64(0, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512, 0) = 512 <0.011393>
26500 20:27:58.099114 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument) <1.955300>

Comment 10 Nir Soffer 2019-09-23 05:45:38 UTC
Ending the test after ~8 hours, see comment 8 for details.

vdsm1:

root     14617  0.3  0.0 225540  6596 ?        Sl   Sep22   2:36 python probe_block_size_stress.py /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
root     14618  0.0  0.0 384948  1044 ?        Sl   Sep22   0:44  \_ /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0 --max-queued-requests -1

[root@storage-ge8-vdsm1 ~]# cat nohup.out | sort -u
512

[root@storage-ge8-vdsm1 ~]# wc -l nohup.out 
22400 nohup.out

vdsm2:

root     19989  0.3  0.0 225540  6596 ?        Sl   Sep22   2:34 python probe_block_size_stress.py /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
root     19990  0.0  0.0 384948  2832 ?        Sl   Sep22   0:37  \_ /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0 --max-queued-requests -1

[root@storage-ge8-vdsm2 ~]# cat nohup.out | sort -u
512

[root@storage-ge8-vdsm2 ~]# wc -l nohup.out 
22396 nohup.out

vdsm3:

root     26493  0.3  0.0 225540  8636 ?        Sl   Sep22   2:35 python probe_block_size_stress.py /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
root     26494  0.0  0.0 384948  2832 ?        Sl   Sep22   0:38  \_ /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0 --max-queued-requests -1

[root@storage-ge8-vdsm3 ~]# cat nohup.out | sort -u
512

[root@storage-ge8-vdsm3 ~]# wc -l nohup.out 
22390 nohup.out

Checking storage for leftover .probe files:

[root@storage-ge8-vdsm3 ~]# ls -lh /rhev/data-center/mnt/glusterSD/gluster01.example.com\:_storage__local__ge8__volume__0/.probe*
ls: cannot access /rhev/data-center/mnt/glusterSD/gluster01.example.com:_storage__local__ge8__volume__0/.probe*: No such file or directory


All hosts issues ~22400 probes during the tests. Since normal probing
interval is 5 minutes, this test is equivalent to 1866 hours (77 days).

Comment 11 SATHEESARAN 2019-09-23 05:51:33 UTC
(In reply to Nir Soffer from comment #10)
> Ending the test after ~8 hours, see comment 8 for details.
> 
> vdsm1:
> 
> root     14617  0.3  0.0 225540  6596 ?        Sl   Sep22   2:36 python
> probe_block_size_stress.py
> /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
> root     14618  0.0  0.0 384948  1044 ?        Sl   Sep22   0:44  \_
> /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0
> --max-queued-requests -1
> 
> [root@storage-ge8-vdsm1 ~]# cat nohup.out | sort -u
> 512
> 
> [root@storage-ge8-vdsm1 ~]# wc -l nohup.out 
> 22400 nohup.out
> 
> vdsm2:
> 
> root     19989  0.3  0.0 225540  6596 ?        Sl   Sep22   2:34 python
> probe_block_size_stress.py
> /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
> root     19990  0.0  0.0 384948  2832 ?        Sl   Sep22   0:37  \_
> /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0
> --max-queued-requests -1
> 
> [root@storage-ge8-vdsm2 ~]# cat nohup.out | sort -u
> 512
> 
> [root@storage-ge8-vdsm2 ~]# wc -l nohup.out 
> 22396 nohup.out
> 
> vdsm3:
> 
> root     26493  0.3  0.0 225540  8636 ?        Sl   Sep22   2:35 python
> probe_block_size_stress.py
> /rhev/data-center/mnt/glusterSD/gluster01.scl.lab.tlv.redhat.com:_storage_
> root     26494  0.0  0.0 384948  2832 ?        Sl   Sep22   0:38  \_
> /usr/libexec/ioprocess --read-pipe-fd 7 --write-pipe-fd 6 --max-threads 0
> --max-queued-requests -1
> 
> [root@storage-ge8-vdsm3 ~]# cat nohup.out | sort -u
> 512
> 
> [root@storage-ge8-vdsm3 ~]# wc -l nohup.out 
> 22390 nohup.out
> 
> Checking storage for leftover .probe files:
> 
> [root@storage-ge8-vdsm3 ~]# ls -lh
> /rhev/data-center/mnt/glusterSD/gluster01.example.com\:
> _storage__local__ge8__volume__0/.probe*
> ls: cannot access
> /rhev/data-center/mnt/glusterSD/gluster01.example.com:
> _storage__local__ge8__volume__0/.probe*: No such file or directory
> 
> 
> All hosts issues ~22400 probes during the tests. Since normal probing
> interval is 5 minutes, this test is equivalent to 1866 hours (77 days).

Hi Nir,

So these tests are good enough to prove that there no metadata issues with gluster mount, right ?

Do you need QE to run these tests again on our setup ?

Comment 12 Nir Soffer 2019-09-23 06:00:34 UTC
(In reply to SATHEESARAN from comment #11)
> (In reply to Nir Soffer from comment #10)
> So these tests are good enough to prove that there no metadata issues with
> gluster mount, right ?

I think they are, but I'm not a gluster expert.

I'm pretty sure the new code is not affected by the metadata management issue
(https://bugzilla.redhat.com/show_bug.cgi?id=1751722#c13), but this code is
writing to deleted files, which is not typical use pattern, so it can expose
other issues in gluster.

> Do you need QE to run these tests again on our setup ?

I think it would be nice to run this on real storage which can be 100x times
faster than the setup I tested for an hour or so.

Comment 13 Nir Soffer 2019-09-23 06:26:13 UTC
Created attachment 1618048 [details]
reproduce script for vdsm <= 4.3.6

To reproduce the original issue, run this script on multiple hosts, probing
the same gluster volume mount:

# python reproduce.py /rhev/data-center/mnt/glusterSD/server:_path
512
512
...

Expected results:
- After some time gluster mount fuse helper will crash.
- size of /rhev/data-center/mnt/glusterSD/server:_path/__DIRECT_IO_TEST__
  will be negative, seen as 16 EiB in ls.

Comment 14 SATHEESARAN 2019-09-23 06:50:14 UTC
(In reply to SATHEESARAN from comment #11)
>> Status: POST → NEW
>> CC: nsoffer
>> Flags: needinfo?(sasundar) needinfo?(kdhananj) → needinfo?(nsoffer)

Accidentally changed the state of the bug from POST to NEW. Restoring the same

Comment 15 SATHEESARAN 2019-09-23 06:51:30 UTC
(In reply to Nir Soffer from comment #9)
> Created attachment 1617908 [details]
> ioprocess trace showing extremely slow unaligned writes
> 
> Krutika, should we file a bug about the slow unaligned writes?
> 
> # grep pwrite64 ioprocess.trace | head -20
> 26500 20:27:40.037730 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.009804>
> 26500 20:27:40.133382 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.911269>
> 26500 20:27:42.044969 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.012527>
> 26500 20:27:42.105923 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.940483>
> 26500 20:27:44.046644 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.028278>
> 26500 20:27:44.162981 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.883644>
> 26500 20:27:46.046896 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.012689>
> 26500 20:27:46.135306 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.912911>
> 26500 20:27:48.048518 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011936>
> 26500 20:27:48.105805 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.952041>
> 26500 20:27:50.058156 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011621>
> 26500 20:27:50.133522 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.922091>
> 26500 20:27:52.055897 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.009861>
> 26500 20:27:52.100258 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.951614>
> 26500 20:27:54.052096 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.015949>
> 26500 20:27:54.125715 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.928027>
> 26500 20:27:56.054076 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.016307>
> 26500 20:27:56.142113 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.914674>
> 26500 20:27:58.057043 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011393>
> 26500 20:27:58.099114 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.955300>

Restoring needinfo on Krutika, which I removed accidentally while adddressing a needinfo on me

Comment 17 SATHEESARAN 2019-09-23 08:05:56 UTC
Nir / Avihai,

I see there are good amount of testing with the latest ioprocess.
As the next step, if these fixes are available as part of the build - vdsm & ioprocess - 
we can do a full fledged sanity testing with these changes in place.

Comment 18 Nir Soffer 2019-09-23 08:52:13 UTC
(In reply to SATHEESARAN from comment #17)
We will release new ioprocess today with these changes.

New vdsm requiring this ioprocess will be available later this week
(bug 1753898).

Comment 20 Krutika Dhananjay 2019-09-26 06:24:38 UTC
(In reply to Nir Soffer from comment #9)
> Created attachment 1617908 [details]
> ioprocess trace showing extremely slow unaligned writes
> 
> Krutika, should we file a bug about the slow unaligned writes?
> 
> # grep pwrite64 ioprocess.trace | head -20
> 26500 20:27:40.037730 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.009804>
> 26500 20:27:40.133382 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.911269>
> 26500 20:27:42.044969 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.012527>
> 26500 20:27:42.105923 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.940483>
> 26500 20:27:44.046644 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.028278>
> 26500 20:27:44.162981 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.883644>
> 26500 20:27:46.046896 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.012689>
> 26500 20:27:46.135306 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.912911>
> 26500 20:27:48.048518 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011936>
> 26500 20:27:48.105805 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.952041>
> 26500 20:27:50.058156 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011621>
> 26500 20:27:50.133522 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.922091>
> 26500 20:27:52.055897 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.009861>
> 26500 20:27:52.100258 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.951614>
> 26500 20:27:54.052096 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.015949>
> 26500 20:27:54.125715 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.928027>
> 26500 20:27:56.054076 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.016307>
> 26500 20:27:56.142113 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.914674>
> 26500 20:27:58.057043 pwrite64(0,
> "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 512,
> 0) = 512 <0.011393>
> 26500 20:27:58.099114 pwrite64(0, "\0", 1, 0) = -1 EINVAL (Invalid argument)
> <1.955300>

Sorry, was out sick.
Can you capture and share the gluster volume profile output?
# gluster volume profile $VOLNAME start
# run the test
# gluster volume profile $VOLNAME info > brick-profile.out
# gluster volume profile $VOLNAME stop

and share the "brick-profile.out" file?

-Krutika

Comment 21 bipin 2019-09-28 05:16:14 UTC
Moving the bug to verified based on the below results:

Version:
=======
glusterfs-3.12.2-47.5.el7rhgs.x86_64
ioprocess-1.3.0-1.el7ev.x86_64
vdsm-4.30.33-1.el7ev.x86_64


1. Complete the RHHI-V deployment
2. Create a RHEL 7.6 template
3. Create 30 vm's using pool 
4. After an hour delete the vm's in the pool
5. Repeat step 3 and step 4 for 3 iteration's 
6. Create vm's using template and ran kernel untar and FIO 

Couldn't see any crash during the test. So moving the bug to verified

Comment 25 errata-xmlrpc 2019-10-10 15:39:40 UTC
Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2019:3030


Note You need to log in before you can comment on or make changes to this bug.