1133580 – Gluster file system hangs under heavy I/O load

Bug 1133580 - Gluster file system hangs under heavy I/O load

Summary: Gluster file system hangs under heavy I/O load

Keywords:
Status:	CLOSED WORKSFORME
Alias:	None
Product:	Red Hat Gluster Storage
Classification:	Red Hat Storage
Component:	core
Sub Component:
Version:	unspecified
Hardware:	Unspecified
OS:	Linux
Priority:	unspecified
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Raghavendra G
QA Contact:	storage-qa-internal@redhat.com
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2014-08-25 13:46 UTC by Hank Jakiela
Modified:	2023-09-14 02:46 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2016-01-13 04:42:34 UTC
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
Test results changing server.outstanding-rpc-limit (20.23 KB, image/png) 2014-08-26 18:38 UTC, Hank Jakiela	no flags	Details
View All

Description Hank Jakiela 2014-08-25 13:46:26 UTC

Description of problem:

We are investigating the performance of RHS for use with Hadoop clusters. We find performance of RHS is often poor and erratic under heavy I/O loads. Our main observation is that individual client nodes can stall for long periods of time, often as long as a minute, occasionally longer. During these stalls, write traffic from a client node stops, while other client nodes continue to write. The last straggler client may continue to stall even though there is nothing else happening in the cluster to cause it stopping. Eventually, the write traffic resumes and the test completes -- most of the time. 

This trouble report documents a case where the stall did not clear itself, and where one copy of a file appears corrupted. Avati investigated the system while it was in this state. His observations are captured in the email chain below.

We are contuing to study performance-only problem, and will open a separate bugzilla entry for that.

Test cluster and load:

- 16-node test cluster in the Phoenix lab, "Cluster G" (host[02-32]-rack07)
- RHS iso installed:  RHSS-2.1.bd-20140502.n.0-RHS-x86_64-DVD1.iso
- Each server has one brick, HW Raid6 (9+2)
- 10Gb/s ethernet, all on one pair of switches

- In this test, each node runs 16 client processes. Each process writes a 1GB file.
- The test uses POSIX calls to open, write, fsync and close files. Hadoop, the HDFS interface and Java are not involved.
- These stalls appear on writes, not reads.
- These stalls appear when writing to Gluster, but not when writing to the individual bricks.

On 08/20/2014 02:28 PM, Hank Jakiela wrote:
>
> Avati,
>
> I've been digging into these long stalls we see under heavy loads. Now the cluster is stuck in an odd state, with what looks like some data corruption. I wonder if you could take a look.
>
> The current state is this:
>
> 1. One node is unable to access one of the files that it had previously written.
>
> 2. All other nodes in the cluster can see the file correctly.
>
> 3. The sick node can see other files correctly. However, ls with a wildcard hangs on that directory.
>
> 4. The two copies of the file are on the first two bricks, but one copy is the wrong size.
>
> If you have time to poke around on the node, it's 10.1.4.8. I'll text you the root password.
>
> Thanks
> Hank
>

From : 	Anand Avati <avati>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Hank Jakiela <hjakiela>
Cc : 	Diane Feddema <dfeddema>

Wed, Aug 20, 2014 07:30 PM
From where was the glusterfs RPM installed? I'm trying to install
debuginfo RPM of glusterfs and is unable to find..

# debuginfo-install glusterfs-3.4.0.59rhs-1.el6rhs.x86_64
...
No debuginfo packages available to install

Here's what I could diagnose from state dumps thus far: It looks like
there are a lot of internal locks which are blocked from the client
host08-rack07 on the file wf-host08-rack07-29-000000 stored in
host02-rack07.

The locks are blocked because there is an unreleased lock on the same
file from client whose transport pointer address is 0xf2bcd0. I need the
debuginfo RPMs so that I can easily find out the client-id string stored
in the transport object, using gdb. Finding out which client the
transport object (whose pointer is 0xf2bcd0) will let me progress to the
next diagnosis step (on that node). I have a hunch it might be the
self-heal daemon client, I'll try to confirm that in other ways in the
mean time.

From : 	Anand Avati <avati>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>

Wed, Aug 20, 2014 08:23 PM
Never mind, I could inspect the client id taking hex dumps from the
calculated offset of the field in the structure (off the pointer). It
appears to be a connection from host08 process pid 27097, which at this
point does not exist. I'm assuming it is some kind of a ghost connection
which is not yet disconnected for whatever reason. Still diagnosing why
we ended up in this situation.

From : 	Anand Avati <avati>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>

Wed, Aug 20, 2014 08:25 PM
And the reason only host08 is seeing this issue is because there are
outstanding (blocked) write requests on this file held up in the
write-behind queue. These held up writes are forcing stat() on the file
to queue behind them, hence ls hangs when stating our file. On other
servers the write-behind queue on the file is empty, and stat requests
smoothly flow through. Just FYI.

On 08/20/2014 06:35 PM, Hank Jakiela wrote:
>
> Yes, this is the only node that would have been doing anything with that particular file (nb, hostname in the filename).
>
> Any idea why the writes were blocked?
>
> We're seeing a lot of cases where many operations on a particular node get held up for long periods, up to 200 seconds or more, after which things start up again. This is the first example where things didn't start up again.
>
> Hank
>

From : 	Anand Avati <avati>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Hank Jakiela <hjakiela>
Cc : 	Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>, Krishnan Parthasarathi <kparthas>

Wed, Aug 20, 2014 09:44 PM

So,
Here's what appears to have happened. During heavy writes, for some
reason there was a connection freeze, and glusterfs client (which was
doing the write) seems to have detected a freeze and raised a ping
timeout, broke the connection and reconnected - all this is visible in
the logs. It appears that the "old connection" breakage hasn't
registered at the server side socket, and because of non-empty socket
buffer in the server, KEEPALIVE hasn't kicked in (which would otherwise
raise a disconnection). This is evidenced by inspecting the client-id of
the blocked lock and the acquired lock being equal.

I'm copying KP and Raghavendra who are familiar with this problem/known
behavior (to raise the priority of addressing this problem).

For now you could either just restart the volume (easiest), or attempt a
manual clear lock. If this is just a test cluster, then restarting the
volume would be the simplest.

Thanks

From : 	Krishnan Parthasarathi <kparthas>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Anand Avati <avati>
Cc : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>

Wed, Aug 20, 2014 11:17 PM
>
> So,
> Here's what appears to have happened. During heavy writes, for some
> reason there was a connection freeze, and glusterfs client (which was
> doing the write) seems to have detected a freeze and raised a ping
> timeout, broke the connection and reconnected - all this is visible in
> the logs. It appears that the "old connection" breakage hasn't
> registered at the server side socket, and because of non-empty socket
> buffer in the server, KEEPALIVE hasn't kicked in (which would otherwise

We must be careful before we conclude that KEEPALIVE mechanism didn't
detect 'closed' remote TCP endpoint. It is possible for the glusterfsd process
(server) to have not 'attended' to the POLLERR that will be raised, on KEEPALIVE
time-out kicking in, simply because the 'epoll thread' was busy.

> raise a disconnection). This is evidenced by inspecting the client-id of
> the blocked lock and the acquired lock being equal.
>

~KP

The brick process is otherwise healthy and responding to other requests
even now, so the epoll thread is not blocked for sure. Brick has two
server_conn_t with the same ->client_id string. One of the server_conn_t
is the frame->root->trans of the acquired lock. The other server_conn_t
is the frame->root->trans of all the blocked locks. So it is clearly the
case that a client has reconnected, but the server is still seeing two
active connections. I can't think of any other reason than keepalive not
kicking in for the server to still hold on to the first/old connection.
Thoughts?

From : 	Diane Feddema <dfeddema>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Anand Avati <avati>
Cc : 	Hank Jakiela <hjakiela>

Wed, Aug 20, 2014 11:47 PM
Based on recent emails it looks like you've made a lot of progress debugging the stall on writes problem.
Sounds like an interesting network issue with KEEPALIVE not detecting dead peers.  

The RHS iso installed on Cluster G is:
RHSS-2.1.bd-20140502.n.0-RHS-x86_64-DVD1.iso

This was the most current stable iso from the RHS group when we setup
the cluster.

From : 	Krishnan Parthasarathi <kparthas>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Anand Avati <avati>
Cc : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>

Thu, Aug 21, 2014 01:39 AM
With your observations, it does seem that keepalive timer is not kicking in.
Meanwhile, I am exploring a way to 'dump' the no. of keepalive probes have been fired,
the no. of probes that have to responded to by the remote endpoint etc, per connection
object on server, via statedump. This would allow us to assert instead of infer keepalive
behaviour.

From : 	Anand Avati <avati>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Krishnan Parthasarathi <kparthas>
Cc : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>

Thu, Aug 21, 2014 02:18 AM
That would be a good diagnostic. I guess the larger point is, we need to
move towards a more robust mechanism. Like a server initiated PING (only
when there are outstanding frames or locks) which is transport
independent (works over RDMA etc.)

From : 	Krishnan Parthasarathi <kparthas>
Subject : 	Re: Long gluster stalls during heavy write load
To : 	Anand Avati <avati>
Cc : 	Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>

Thu, Aug 21, 2014 02:28 AM
The keepalive timer values are available in /proc/net/tcp{,6}. netstat --timers
has a more readable output. I haven't yet managed to find a per-socket interface
for fetching the current state of keepalive timers. Without such an interface it doesn't
make sense for the glusterfs server to walk through all connections on a system and filter
out connections that aren't served by it.

Yes, we need to have an heart-beating mechanism at the server-side as well.

Comment 2 Pranith Kumar K 2014-08-25 16:19:12 UTC

hi Hank,
    This bug seems very much similar to 1089470.

Use the following volume option as a workaround:

gluster volume set <volname> server.outstanding-rpc-limit 0

Copy pasting the commit description for reference.

commit bc434b3ca961757ade8c6093f4ff0dbe4b3a4672
Author: Pranith Kumar K <pkarampu>
Date:   Wed Apr 23 14:05:10 2014 +0530

    rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling
    
    Problem:
    When iozone is in progress, number of blocking inodelks sometimes becomes greater
    than the threshold number of rpc requests allowed for that client
    (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client
    will not be read until all the outstanding requests are processed and replied
    to. But because no more requests are read from that client, unlocks on the
    already granted locks will never come thus the number of outstanding requests
    would never come down. This leads to a ping-timeout on the client.
    
    Fix:
    Do not account INODELK/ENTRYLK/LK for throttling
    
    Change-Id: I59c6b54e7ec24ed7375ff977e817a9cb73469806
    BUG: 1089470
    Signed-off-by: Pranith Kumar K <pkarampu>
    Reviewed-on: http://review.gluster.org/7531
    Tested-by: Gluster Build System <jenkins.com>
    Reviewed-by: Krutika Dhananjay <kdhananj>
    Reviewed-by: Anand Avati <avati>

Pranith

Comment 3 Hank Jakiela 2014-08-25 21:12:52 UTC

Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the longest stalls. So, this certainly helps. 

I/O traffic is still somwhat erratic.  We'll continue to investigate. It seems flow control with many concurrent write streams still needs to be tuned. Any suggestions?

By the way, the documentation of some of these newer tuning parameters is weak to non-existent. What is the default value for server.outstanding-rpc-limit? Was it reduced in response to:  http://review.gluster.org/#/c/6696/ ?

Thanks for the quick response.

Comment 4 Hank Jakiela 2014-08-26 18:38:00 UTC

Created attachment 931081 [details]
Test results changing server.outstanding-rpc-limit

This table shows the impact of setting the RPC limit (server.outstanding-rpc-limit).

The test runs on a 16-node gluster cluster. One brick per node, HW Raid6. replication=2. Standard POSIX calls, ie, no Hadoop, HDFS interface or Java involved.

On each node, 24 processes write 512MB files. The table shows elapsed time for the runs.

For large RPC limit, run times are fast and repeatable. For example, a run time of 60 sec would mean (16 x 24 x 2 x 512MB)/60sec = 6500 MB/sec data written to the bricks. 

As the RPC limit is reduced, run times become more erratic. Long stalls appear on individual nodes, different nodes on each run. Problems start to appear with an RPC limit of 64.

With an RPC limit of 24 (ie, one per file being written) run time become 3X longer than with large RPC limit.

With an RPC limit of 16, run times are 5X longer.

Setting the RPC limit to 16 can cause severe performance problems for write-heavy workloads. These are common in Hadoop.

Comment 5 Hank Jakiela 2014-08-26 18:39:35 UTC

#hj> rpm -qa | egrep -i 'rhs|gluster' | sort 
ctdb-1.0.114.6-1.el6rhs.x86_64
glusterfs-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-api-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-fuse-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-geo-replication-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-libs-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-rdma-3.4.0.59rhs-1.el6rhs.x86_64
glusterfs-server-3.4.0.59rhs-1.el6rhs.x86_64
gluster-swift-1.10.0-2.el6rhs.noarch
gluster-swift-account-1.10.0-2.el6rhs.noarch
gluster-swift-container-1.10.0-2.el6rhs.noarch
gluster-swift-object-1.10.0-2.el6rhs.noarch
gluster-swift-plugin-1.10.0-5.el6rhs.noarch
gluster-swift-proxy-1.10.0-2.el6rhs.noarch
libsmbclient-3.6.9-167.10.el6rhs.x86_64
libsmbclient-devel-3.6.9-167.10.el6rhs.x86_64
python-eventlet-0.9.16-5.el6_rhs2.0.noarch
python-greenlet-0.3.1-6.el6_rhs2.0.x86_64
python-rhsm-1.8.17-1.el6_4.x86_64
redhat-storage-logos-60.0.17-1.el6rhs.noarch
redhat-storage-server-2.1.2.0-3.el6rhs.noarch
rhs-hadoop-2.1.6b2-1.noarch
rhs-hadoop-install-0_85-1.el6rhs.noarch
samba-3.6.9-167.10.el6rhs.x86_64
samba-client-3.6.9-167.10.el6rhs.x86_64
samba-common-3.6.9-167.10.el6rhs.x86_64
samba-doc-3.6.9-167.10.el6rhs.x86_64
samba-domainjoin-gui-3.6.9-167.10.el6rhs.x86_64
samba-glusterfs-3.6.9-167.10.el6rhs.x86_64
samba-swat-3.6.9-167.10.el6rhs.x86_64
samba-winbind-3.6.9-167.10.el6rhs.x86_64
samba-winbind-clients-3.6.9-167.10.el6rhs.x86_64
samba-winbind-devel-3.6.9-167.10.el6rhs.x86_64
samba-winbind-krb5-locator-3.6.9-167.10.el6rhs.x86_64
vdsm-4.13.0-24.el6rhs.x86_64
vdsm-cli-4.13.0-24.el6rhs.noarch
vdsm-gluster-4.13.0-24.el6rhs.noarch
vdsm-python-4.13.0-24.el6rhs.x86_64
vdsm-python-cpopen-4.13.0-24.el6rhs.x86_64
vdsm-reg-4.13.0-24.el6rhs.noarch
vdsm-xmlrpc-4.13.0-24.el6rhs.noarch

Comment 6 Hank Jakiela 2014-08-26 18:40:32 UTC

#hj> gluster volume info HadoopVol
 
Volume Name: HadoopVol
Type: Distributed-Replicate
Volume ID: 8cae7404-5bb2-4582-9eb1-ba1cf80b1797
Status: Started
Number of Bricks: 8 x 2 = 16
Transport-type: tcp
Bricks:
Brick1: host02-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick2: host04-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick3: host06-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick4: host08-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick5: host10-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick6: host12-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick7: host14-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick8: host16-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick9: host18-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick10: host20-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick11: host22-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick12: host24-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick13: host26-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick14: host28-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick15: host30-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Brick16: host32-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol
Options Reconfigured:
performance.write-behind: ON
server.outstanding-rpc-limit: 512
performance.io-thread-count: 32
performance.flush-behind: ON
performance.write-behind-window-size: 128MB
performance.cache-size: 1GB
performance.cache-refresh-timeout: 60
performance.stat-prefetch: off
performance.quick-read: off
cluster.eager-lock: on

Comment 7 Pranith Kumar K 2014-08-27 05:37:33 UTC

(In reply to Hank Jakiela from comment #3)
> Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the
> longest stalls. So, this certainly helps. 
> 
> I/O traffic is still somwhat erratic.  We'll continue to investigate. It
> seems flow control with many concurrent write streams still needs to be
> tuned. Any suggestions?
> 
> By the way, the documentation of some of these newer tuning parameters is
> weak to non-existent. What is the default value for
> server.outstanding-rpc-limit? Was it reduced in response to: 
> http://review.gluster.org/#/c/6696/ ?
> 
> Thanks for the quick response.

I will get this option documented.
From source code this is the description we have:
This option is a parameter to throttle the number of incoming RPC requests from a client. 0 means no limit (can potentially run out of memory)

Default value: 64
minimum limit we can set is: 1
maximum limit we can set is: 65536
To set it to unlimited set the option value to: 0

16 is the default value for gluster NFS server. Are these tests being run on gluster nfs mount or gluster native mount (i.e. fuse)?

Comment 8 Hank Jakiela 2014-08-27 12:44:40 UTC

(In reply to Pranith Kumar K from comment #7)
> (In reply to Hank Jakiela from comment #3)
> > Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the
> > longest stalls. So, this certainly helps. 
> > 
> > I/O traffic is still somwhat erratic.  We'll continue to investigate. It
> > seems flow control with many concurrent write streams still needs to be
> > tuned. Any suggestions?
> > 
> > By the way, the documentation of some of these newer tuning parameters is
> > weak to non-existent. What is the default value for
> > server.outstanding-rpc-limit? Was it reduced in response to: 
> > http://review.gluster.org/#/c/6696/ ?
> > 
> > Thanks for the quick response.
> 
> I will get this option documented.
> From source code this is the description we have:
> This option is a parameter to throttle the number of incoming RPC requests
> from a client. 0 means no limit (can potentially run out of memory)
> 
> Default value: 64
> minimum limit we can set is: 1
> maximum limit we can set is: 65536
> To set it to unlimited set the option value to: 0
> 
> 16 is the default value for gluster NFS server. Are these tests being run on
> gluster nfs mount or gluster native mount (i.e. fuse)?

HJ> We are not using NFS in this case. Just the FUSE mount.

Comment 9 Pranith Kumar K 2014-08-27 12:46:11 UTC

(In reply to Hank Jakiela from comment #8)
> (In reply to Pranith Kumar K from comment #7)
> > (In reply to Hank Jakiela from comment #3)
> > > Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the
> > > longest stalls. So, this certainly helps. 
> > > 
> > > I/O traffic is still somwhat erratic.  We'll continue to investigate. It
> > > seems flow control with many concurrent write streams still needs to be
> > > tuned. Any suggestions?
> > > 
> > > By the way, the documentation of some of these newer tuning parameters is
> > > weak to non-existent. What is the default value for
> > > server.outstanding-rpc-limit? Was it reduced in response to: 
> > > http://review.gluster.org/#/c/6696/ ?
> > > 
> > > Thanks for the quick response.
> > 
> > I will get this option documented.
> > From source code this is the description we have:
> > This option is a parameter to throttle the number of incoming RPC requests
> > from a client. 0 means no limit (can potentially run out of memory)
> > 
> > Default value: 64
> > minimum limit we can set is: 1
> > maximum limit we can set is: 65536
> > To set it to unlimited set the option value to: 0
> > 
> > 16 is the default value for gluster NFS server. Are these tests being run on
> > gluster nfs mount or gluster native mount (i.e. fuse)?
> 
> HJ> We are not using NFS in this case. Just the FUSE mount.

In that case no impact because of the patch http://review.gluster.org/#/c/6696/

Comment 14 Red Hat Bugzilla 2023-09-14 02:46:19 UTC

The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days

Note You need to log in before you can comment on or make changes to this bug.