1339136 – Some of the VMs pause with read-only file system error even when volume-status reports all bricks are up

Bug 1339136 - Some of the VMs pause with read-only file system error even when volume-status reports all bricks are up

Summary: Some of the VMs pause with read-only file system error even when volume-statu...

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	rhgs-3.1
Hardware:	Unspecified
OS:	Unspecified
Priority:	high
Severity:	high
Target Milestone:	---
Target Release:	RHGS 3.1.3
Assignee:	Krutika Dhananjay
QA Contact:	SATHEESARAN
Docs Contact:
URL:
Whiteboard:
Depends On:	1314421
Blocks:	Gluster-HC-1 1311817
TreeView+	depends on / blocked

Reported:	2016-05-24 08:43 UTC by Krutika Dhananjay
Modified:	2018-12-05 07:53 UTC (History)
CC List:	13 users (show)
Fixed In Version:	glusterfs-3.7.9-9
Doc Type:	If docs needed, set a value
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-06-23 05:24:03 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
screenshot showing vm corruption (184.31 KB, image/png) 2016-05-27 00:43 UTC, Paul Cuzner	no flags	Details
vol profile from random read test run - 9 concurrent VMs (15.46 KB, text/plain) 2016-05-27 01:18 UTC, Paul Cuzner	no flags	Details
View All

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Product Errata	RHBA-2016:1240	0	normal	SHIPPED_LIVE	Red Hat Gluster Storage 3.1 Update 3	2016-06-23 08:51:28 UTC

Description Krutika Dhananjay 2016-05-24 08:43:59 UTC

Description of problem:

In Paul Cuzner's setup, with 3.7.9-6 he hit an issue where some of his vms paused while running the random-write workload on a 3-node HC setup.

Pranith, Sahina and I debugged this and here's what we have found so far:

1) The fops from the app (qemu in this case) are all failing with 'Read-only file system'. This can happen if quorum is lost.

<sample-log>

[2016-05-24 08:09:42.823001] W [fuse-bridge.c:1066:fuse_setattr_cbk] 0-glusterfs-fuse: 401433: SETATTR() /__DIRECT_IO_TEST__ => -1 (Read-only file system)

</sample-log>

2) Volume status reported that all bricks are up.

3) On examining the connection between client on the first node and the three bricks, here's what we found:

[root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-0/private
[xlator.protocol.client.vmstore-client-0.priv]
fd.0.remote_fd = -1
fd.1.remote_fd = -1
fd.2.remote_fd = -1
fd.3.remote_fd = -1
fd.4.remote_fd = -1
fd.5.remote_fd = -1
fd.6.remote_fd = -1
fd.7.remote_fd = -1
fd.8.remote_fd = -1
connecting = 0
connected = 0
total_bytes_read = 2447897508
ping_timeout = 30
total_bytes_written = 333622728
ping_msgs_sent = 1
msgs_sent = 23

[root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-1/private
[xlator.protocol.client.vmstore-client-1.priv]
fd.0.remote_fd = 0
fd.1.remote_fd = 1
fd.2.remote_fd = 2
fd.3.remote_fd = 3
fd.4.remote_fd = 4
fd.5.remote_fd = 5
fd.6.remote_fd = 12
connecting = 0
connected = 1
total_bytes_read = 2583413660
ping_timeout = 30
total_bytes_written = 2083188104
ping_msgs_sent = 140
msgs_sent = 48249

[root@gprfc085 gprfc085-glfs.rhev-rhss.lab:vmstore]# cat .meta/graphs/active/vmstore-client-2/private
[xlator.protocol.client.vmstore-client-2.priv]
fd.0.remote_fd = -1
fd.1.remote_fd = -1
fd.2.remote_fd = -1
fd.3.remote_fd = -1
fd.4.remote_fd = -1
fd.5.remote_fd = -1
fd.6.remote_fd = -1
connecting = 0
connected = 0
total_bytes_read = 1533713824
ping_timeout = 30
total_bytes_written = 5070587840
ping_msgs_sent = 1
msgs_sent = 87

Basically what the above suggests is that the mount process is 'connected' to brick1 but disconnected from brick-0 and brick-2.

But the connection state in-memory says otherwise:

(gdb) ptype $15->rpc->conn->connected
type = char
(gdb) p $15->rpc->conn->connected     
$19 = 1 '\001'

This means there is possibly something wrong in the state of the connection maintained in gluster.


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.

Actual results:


Expected results:


Additional info:

Comment 9 Paul Cuzner 2016-05-27 00:43:23 UTC

Created attachment 1162322 [details]
screenshot showing vm corruption

Comment 11 Paul Cuzner 2016-05-27 01:18:42 UTC

Created attachment 1162323 [details]
vol profile from random read test run - 9 concurrent VMs

Comment 12 SATHEESARAN 2016-05-30 10:47:44 UTC

I was testing with DHT volume with one brick.

For the first run, I haven't seen any issues.
But during the second iteration of run, with stats enabled, I witnessed that one of the VMs was stalled for few minutes with error message - "kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

But with replica 3, I have seen such issues with many VMs

I am trying to reproduce this issue with directly with XFS backend

Comment 13 Pranith Kumar K 2016-06-01 02:19:31 UTC

I debugged this issue on Paul's set up there is some interesting data:
On 85:
grep pwrite * | awk '{print $NF}' | cut -f2 -d'<' | cut -f1 -d '>' | sort -n
...
1.010784
1.602589
2.220474
9.561357
23.518096
35.252092

On 86:
0.328969
2.149209
4.145603
4.193315
11.286010
21.170648
22.617153
30.987933
36.785079

On 87:
0.567854
1.445733
1.554731
1.789662
3.405890
5.528438
8.724252
9.126872
16.015166
16.091657
33.126876

fdatasync took similar times:

85:
fdatasync
6.223173
6.291503
6.806233
8.686716
9.062541
11.541201
12.027429
12.837996
12.865618
41.909784

86:
fdatasync
4.513816
4.546673
5.085298
6.565026
6.912619
8.550049
8.763380
9.738982
12.135651
37.816719

87:
fdatasync
10.378238
10.409009
10.502916
10.548864
10.690511
11.242788
16.045023
16.047131
16.083629
44.731631

Comment 14 SATHEESARAN 2016-06-01 03:13:28 UTC

I could able to hit this issue in my setup ( BLR labs ) consistently couple of days ago. I am unable to hit this issue now.

But Paul is hitting this issue consistently BAGL setup.

Comment 15 SATHEESARAN 2016-06-01 03:15:33 UTC

One another information:

I tried testing with plain XFS backend and I don't see the VM pause issue with 7 iterations of testing

Comment 16 Pranith Kumar K 2016-06-01 11:49:50 UTC

Theory for why this may have happened:
Both replicate and sharding need the write to happen inside locks so that after write it can find out if the write was appending write/what are the number of new blocks allocated for the write etc. But the lock used in posix xlator is an in memory spin-lock. index xlator uses the same lock to perform operations, but index xlator is above io-threads so there is a chance epoll thread to hang to get the index's inode context while a write is in progress. If the epoll thread hangs for as long as 30-40 seconds then ping-timer expiry can happen.

We provided a build to Sas which moves syscalls to happen inside a different mutex-lock so that epoll thread never contends with a syscall in posix. If this fixes the issue, we will add more logs to confirm our theory and get this patch merged.

Comment 17 SATHEESARAN 2016-06-01 20:28:31 UTC

I have tested with Paul's BAGL setup with the scratch/private build[1] provided by Pranith

VMs still go to paused state and one of the hypervisor doesn't responds while 100% write workload is happening concurrently on all the 9 application VMs

[1] - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118991

Comment 18 Pranith Kumar K 2016-06-01 22:57:52 UTC

Thanks for staying so late to complete this Satheesaran. From the strace I found one more possibility of rpc-throttling contributing to ping-timer expiry. Paul and I are now working to set it to zero and try the run:
gluster v set <volname> server.outstanding-rpc-limit 0

Will update with what we observe.

Comment 19 Pranith Kumar K 2016-06-02 00:31:38 UTC

For the benefit of others, let me explain what this option is all about:
Brick side there is a limit about how many in-flight rpc messages that can be processed by the brick. After 64 in-flight fops sent by client to brick, brick will stop listening to the transport until at least one of these fops is replied. What is happening in the setup is we have one fop taking more than 30-40 seconds, so the server transport would have stopped listening to the client for more than 30 seconds. This will lead to ping rpcs not being responded to. That was my 2nd theory for the ping-timeouts.

Setting option "gluster v set <volname> server.outstanding-rpc-limit 0" will disable this limit. This option was mainly introduced for NFS servers because nfs client retransmits the requests and it was leading to OOM killers. gluster doesn't retransmit the messages.

This fixed the VM pause because of ping timeouts, but we still see VM pauses because heartbeat is exceeding in rhev which is leading to reboot of the host. Need to debug further with Sahina/sas/krutika for this. May be Brian Foster will also get involved from XFS to figure this one out. write/fdatasync/stat all are taking close to 30-40 seconds on XFS.

Pranith

Comment 20 SATHEESARAN 2016-06-02 08:24:34 UTC

Tested with the scenario where power management on all the hosts was disabled.
I am still hitting this issues - ( VM pause, one of the host going down and coming back)

Comment 21 Pranith Kumar K 2016-06-02 09:01:46 UTC

(In reply to SATHEESARAN from comment #17)
> I have tested with Paul's BAGL setup with the scratch/private build[1]
> provided by Pranith
> 
> VMs still go to paused state and one of the hypervisor doesn't responds
> while 100% write workload is happening concurrently on all the 9 application
> VMs
> 
> [1] - https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11118991

This build is provided by Krutika actually...

Comment 22 Pranith Kumar K 2016-06-02 11:49:43 UTC

I realized that my initial analysis today is wrong because Du pointed out that throttling is disabled by default. I was almost out of ideas in the noon then Krutika analysed the straces and found that neither fsync nor pwrite are taking more than 3 seconds in the workload on replica-3 with o-direct. Where as without o-direct both fdatasync, pwrite were taking almost 30 seconds in the worst case.

Our next steps are to check if these issues happen with o-direct patches Krutika has been working on. We will be providing a test build with these patches.

Comment 23 Krutika Dhananjay 2016-06-02 17:29:38 UTC

Here's the new build with odirect fixes:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11126162

When you run tests with this build, make sure to have
i) disabled remote-dio
ii) enabled performance.strict-o-direct

In order to test that it is not breaking existing functionality, I ran following tests
1) `iozone -I` on the mountpoint of a sharded volume (no virt settings). the option 'I' opens the file
with odirect for both reads and writes. - RESULT: PASS

2) On my 3-way replicated sharded volume with 1 vm (100G), I first disabled remote-dio, then I
enabled strict-o-direct. Inside a single vm I ran fio (strictly write-only) with the job file at http://10.70.35.127/randw.job (courtesy: sas). fio completed in less than 5 minutes (3 min says my intuition) without a glitch. No ping timeouts, no pause, no errors due to possible bugs in the odirect work for sharding. All log files looked squeaky clean at the end of it. - RESULT: PASS.

Pranith has also looked into my patches. We've found bugs and fixed them. So now the patches look OK.

Now, over to Sas and Paul... :)

Comment 24 Krutika Dhananjay 2016-06-03 04:25:48 UTC

UPDATE:

Paul ran his tests on the new build (see comment #23) and saw no VM pause this time (yay!). However, he ran into another issue because of a bug in sharding and Pranith and he debugged it (very) early this morning, for which the following patch was sent as a fix:
        http://review.gluster.org/#/c/14623/

With this patch and http://review.gluster.org/#/c/14215/ I've made another scratch build and shared it with Paul and Sas:

https://brewweb.engineering.redhat.com/brew/taskinfo?taskID=11129563

Waiting for their feedback at the moment.

-Krutika

Comment 25 Atin Mukherjee 2016-06-06 10:38:03 UTC

Krutika,

As discussed, please list down all the patches which are to go in to fix this bug.

~Atin

Comment 26 Krutika Dhananjay 2016-06-06 11:31:28 UTC

(In reply to Atin Mukherjee from comment #25)
> Krutika,
> 
> As discussed, please list down all the patches which are to go in to fix
> this bug.
> 
> ~Atin

The following patches need to be backported to downstream:
http://review.gluster.org/14271
http://review.gluster.org/10219
http://review.gluster.org/14215
http://review.gluster.org/14191
http://review.gluster.org/14639
http://review.gluster.org/14623

-Krutika

Comment 27 Krutika Dhananjay 2016-06-06 11:35:29 UTC

(In reply to Krutika Dhananjay from comment #26)
> (In reply to Atin Mukherjee from comment #25)
> > Krutika,
> > 
> > As discussed, please list down all the patches which are to go in to fix
> > this bug.
> > 
> > ~Atin
> 
> The following patches need to be backported to downstream:
> http://review.gluster.org/14271
> http://review.gluster.org/10219
> http://review.gluster.org/14215
> http://review.gluster.org/14191
> http://review.gluster.org/14639
> http://review.gluster.org/14623
> 
> -Krutika

Note that of these 6 patches, only three of them fix the actual issue. And one of the remaining patches adds log messages for debugging, another backports a patch without which there would be conflicts, and another that fixes protocol/client to handle remote-dio correctly.

Comment 28 Krutika Dhananjay 2016-06-07 08:57:08 UTC

All six patches have been merged into downstream.

Comment 30 SATHEESARAN 2016-06-13 10:15:05 UTC

Tested with RHGS 3.1.3 nightly build - glusterfs-3.7.9-10.elrhgs with the following test

1. Created replica 3 volume and optimized the volume for virt-store
2. Disabled remote-dio & enable strict-o-direct on the volume
3. Created the RHEV data domain with the above created volume
4. Created 30 Application VMs
5. Ran the 100% write workload ( random write ) on  all the VMs using FIO tool
6. Repeated the test twice and there are no issues found

Comment 31 Sanjay Rao 2016-06-13 16:33:17 UTC

I was running a fio test with 24 VMs and I was able to reproduce this problem when I ran the workload in cache mode. The workload worked fine when I was running with directIO upto 30 VMs. This happens when the system runs very low on free memory and the kswapd activity goes up. vmstat did not report any swapping but the system was definitely very close to swapping.

Comment 32 Pranith Kumar K 2016-06-14 01:10:38 UTC

Sanjay,
     We found the RC to be fsyncs taking almost 30-40 seconds on XFS which is leading to too much delay/freeze of the threads when the workload is run with cache mode. With directIO none of these problems are seen and the max fsync latency Krutika observed was 2 seconds. That is the reason we got all the directio patches in as soon as possible.

Pranith

Comment 33 Sanjay Rao 2016-06-14 01:16:40 UTC

Ok. Sounds good.

Comment 35 errata-xmlrpc 2016-06-23 05:24:03 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

https://access.redhat.com/errata/RHBA-2016:1240

Note You need to log in before you can comment on or make changes to this bug.