Description of problem: In Paul Cuzner's setup, with 3.7.9-6 he hit an issue where some of his vms paused while running the random-write workload on a 3-node HC setup. Pranith, Sahina and I debugged this and here's what we have found so far: 1) The fops from the app (qemu in this case) are all failing with 'Read-only file system'. This can happen if quorum is lost. <sample-log> [2016-05-24 08:09:42.823001] W [fuse-bridge.c:1066:fuse_setattr_cbk] 0-glusterfs-fuse: 401433: SETATTR() /__DIRECT_IO_TEST__ => -1 (Read-only file system) </sample-log> 2) Volume status reported that all bricks are up. 3) On examining the connection between client on the first node and the three bricks, here's what we found: [root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-0/private [xlator.protocol.client.vmstore-client-0.priv] fd.0.remote_fd = -1 fd.1.remote_fd = -1 fd.2.remote_fd = -1 fd.3.remote_fd = -1 fd.4.remote_fd = -1 fd.5.remote_fd = -1 fd.6.remote_fd = -1 fd.7.remote_fd = -1 fd.8.remote_fd = -1 connecting = 0 connected = 0 total_bytes_read = 2447897508 ping_timeout = 30 total_bytes_written = 333622728 ping_msgs_sent = 1 msgs_sent = 23 [root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-1/private [xlator.protocol.client.vmstore-client-1.priv] fd.0.remote_fd = 0 fd.1.remote_fd = 1 fd.2.remote_fd = 2 fd.3.remote_fd = 3 fd.4.remote_fd = 4 fd.5.remote_fd = 5 fd.6.remote_fd = 12 connecting = 0 connected = 1 total_bytes_read = 2583413660 ping_timeout = 30 total_bytes_written = 2083188104 ping_msgs_sent = 140 msgs_sent = 48249 [root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-2/private [xlator.protocol.client.vmstore-client-2.priv] fd.0.remote_fd = -1 fd.1.remote_fd = -1 fd.2.remote_fd = -1 fd.3.remote_fd = -1 fd.4.remote_fd = -1 fd.5.remote_fd = -1 fd.6.remote_fd = -1 connecting = 0 connected = 0 total_bytes_read = 1533713824 ping_timeout = 30 total_bytes_written = 5070587840 ping_msgs_sent = 1 msgs_sent = 87 Basically what the above suggests is that the mount process is 'connected' to brick1 but disconnected from brick-0 and brick-2. But the connection state in-memory says otherwise: (gdb) ptype $15->rpc->conn->connected type = char (gdb) p $15->rpc->conn->connected $19 = 1 '\001' This means there is possibly something wrong in the state of the connection maintained in gluster. Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
Created attachment 1162322 [details] screenshot showing vm corruption
Created attachment 1162323 [details] vol profile from random read test run - 9 concurrent VMs
I was testing with DHT volume with one brick. For the first run, I haven't seen any issues. But during the second iteration of run, with stats enabled, I witnessed that one of the VMs was stalled for few minutes with error message - "kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. But with replica 3, I have seen such issues with many VMs I am trying to reproduce this issue with directly with XFS backend
I debugged this issue on Paul's set up there is some interesting data: On 85: grep pwrite * | awk '{print $NF}' | cut -f2 -d'<' | cut -f1 -d '>' | sort -n ... 1.010784 1.602589 2.220474 9.561357 23.518096 35.252092 On 86: 0.328969 2.149209 4.145603 4.193315 11.286010 21.170648 22.617153 30.987933 36.785079 On 87: 0.567854 1.445733 1.554731 1.789662 3.405890 5.528438 8.724252 9.126872 16.015166 16.091657 33.126876 fdatasync took similar times: 85: fdatasync 6.223173 6.291503 6.806233 8.686716 9.062541 11.541201 12.027429 12.837996 12.865618 41.909784 86: fdatasync 4.513816 4.546673 5.085298 6.565026 6.912619 8.550049 8.763380 9.738982 12.135651 37.816719 87: fdatasync 10.378238 10.409009 10.502916 10.548864 10.690511 11.242788 16.045023 16.047131 16.083629 44.731631
I could able to hit this issue in my setup ( BLR labs ) consistently couple of days ago. I am unable to hit this issue now. But Paul is hitting this issue consistently BAGL setup.
One another information: I tried testing with plain XFS backend and I don't see the VM pause issue with 7 iterations of testing
Theory for why this may have happened: Both replicate and sharding need the write to happen inside locks so that after write it can find out if the write was appending write/what are the number of new blocks allocated for the write etc. But the lock used in posix xlator is an in memory spin-lock. index xlator uses the same lock to perform operations, but index xlator is above io-threads so there is a chance epoll thread to hang to get the index's inode context while a write is in progress. If the epoll thread hangs for as long as 30-40 seconds then ping-timer expiry can happen. We provided a build to Sas which moves syscalls to happen inside a different mutex-lock so that epoll thread never contends with a syscall in posix. If this fixes the issue, we will add more logs to confirm our theory and get this patch merged.
I have tested with Paul's BAGL setup with the scratch/private build[1] provided by Pranith VMs still go to paused state and one of the hypervisor doesn't responds while 100% write workload is happening concurrently on all the 9 application VMs [1] - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118991
Thanks for staying so late to complete this Satheesaran. From the strace I found one more possibility of rpc-throttling contributing to ping-timer expiry. Paul and I are now working to set it to zero and try the run: gluster v set <volname> server.outstanding-rpc-limit 0 Will update with what we observe.
For the benefit of others, let me explain what this option is all about: Brick side there is a limit about how many in-flight rpc messages that can be processed by the brick. After 64 in-flight fops sent by client to brick, brick will stop listening to the transport until at least one of these fops is replied. What is happening in the setup is we have one fop taking more than 30-40 seconds, so the server transport would have stopped listening to the client for more than 30 seconds. This will lead to ping rpcs not being responded to. That was my 2nd theory for the ping-timeouts. Setting option "gluster v set <volname> server.outstanding-rpc-limit 0" will disable this limit. This option was mainly introduced for NFS servers because nfs client retransmits the requests and it was leading to OOM killers. gluster doesn't retransmit the messages. This fixed the VM pause because of ping timeouts, but we still see VM pauses because heartbeat is exceeding in rhev which is leading to reboot of the host. Need to debug further with Sahina/sas/krutika for this. May be Brian Foster will also get involved from XFS to figure this one out. write/fdatasync/stat all are taking close to 30-40 seconds on XFS. Pranith
Tested with the scenario where power management on all the hosts was disabled. I am still hitting this issues - ( VM pause, one of the host going down and coming back)
(In reply to SATHEESARAN from comment #17) > I have tested with Paul's BAGL setup with the scratch/private build[1] > provided by Pranith > > VMs still go to paused state and one of the hypervisor doesn't responds > while 100% write workload is happening concurrently on all the 9 application > VMs > > [1] - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118991 This build is provided by Krutika actually...
I realized that my initial analysis today is wrong because Du pointed out that throttling is disabled by default. I was almost out of ideas in the noon then Krutika analysed the straces and found that neither fsync nor pwrite are taking more than 3 seconds in the workload on replica-3 with o-direct. Where as without o-direct both fdatasync, pwrite were taking almost 30 seconds in the worst case. Our next steps are to check if these issues happen with o-direct patches Krutika has been working on. We will be providing a test build with these patches.
Here's the new build with odirect fixes: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11126162 When you run tests with this build, make sure to have i) disabled remote-dio ii) enabled performance.strict-o-direct In order to test that it is not breaking existing functionality, I ran following tests 1) `iozone -I` on the mountpoint of a sharded volume (no virt settings). the option 'I' opens the file with odirect for both reads and writes. - RESULT: PASS 2) On my 3-way replicated sharded volume with 1 vm (100G), I first disabled remote-dio, then I enabled strict-o-direct. Inside a single vm I ran fio (strictly write-only) with the job file at http://10.70.35.127/randw.job (courtesy: sas). fio completed in less than 5 minutes (3 min says my intuition) without a glitch. No ping timeouts, no pause, no errors due to possible bugs in the odirect work for sharding. All log files looked squeaky clean at the end of it. - RESULT: PASS. Pranith has also looked into my patches. We've found bugs and fixed them. So now the patches look OK. Now, over to Sas and Paul... :)
UPDATE: Paul ran his tests on the new build (see comment #23) and saw no VM pause this time (yay!). However, he ran into another issue because of a bug in sharding and Pranith and he debugged it (very) early this morning, for which the following patch was sent as a fix: http://review.gluster.org/#/c/14623/ With this patch and http://review.gluster.org/#/c/14215/ I've made another scratch build and shared it with Paul and Sas: https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11129563 Waiting for their feedback at the moment. -Krutika
Krutika, As discussed, please list down all the patches which are to go in to fix this bug. ~Atin
(In reply to Atin Mukherjee from comment #25) > Krutika, > > As discussed, please list down all the patches which are to go in to fix > this bug. > > ~Atin The following patches need to be backported to downstream: http://review.gluster.org/14271 http://review.gluster.org/10219 http://review.gluster.org/14215 http://review.gluster.org/14191 http://review.gluster.org/14639 http://review.gluster.org/14623 -Krutika
(In reply to Krutika Dhananjay from comment #26) > (In reply to Atin Mukherjee from comment #25) > > Krutika, > > > > As discussed, please list down all the patches which are to go in to fix > > this bug. > > > > ~Atin > > The following patches need to be backported to downstream: > http://review.gluster.org/14271 > http://review.gluster.org/10219 > http://review.gluster.org/14215 > http://review.gluster.org/14191 > http://review.gluster.org/14639 > http://review.gluster.org/14623 > > -Krutika Note that of these 6 patches, only three of them fix the actual issue. And one of the remaining patches adds log messages for debugging, another backports a patch without which there would be conflicts, and another that fixes protocol/client to handle remote-dio correctly.
All six patches have been merged into downstream.
Tested with RHGS 3.1.3 nightly build - glusterfs-3.7.9-10.elrhgs with the following test 1. Created replica 3 volume and optimized the volume for virt-store 2. Disabled remote-dio & enable strict-o-direct on the volume 3. Created the RHEV data domain with the above created volume 4. Created 30 Application VMs 5. Ran the 100% write workload ( random write ) on all the VMs using FIO tool 6. Repeated the test twice and there are no issues found
I was running a fio test with 24 VMs and I was able to reproduce this problem when I ran the workload in cache mode. The workload worked fine when I was running with directIO upto 30 VMs. This happens when the system runs very low on free memory and the kswapd activity goes up. vmstat did not report any swapping but the system was definitely very close to swapping.
Sanjay, We found the RC to be fsyncs taking almost 30-40 seconds on XFS which is leading to too much delay/freeze of the threads when the workload is run with cache mode. With directIO none of these problems are seen and the max fsync latency Krutika observed was 2 seconds. That is the reason we got all the directio patches in as soon as possible. Pranith
Ok. Sounds good.
Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. https://access.redhat.com/errata/RHBA-2016:1240