Description of problem: We are investigating the performance of RHS for use with Hadoop clusters. We find performance of RHS is often poor and erratic under heavy I/O loads. Our main observation is that individual client nodes can stall for long periods of time, often as long as a minute, occasionally longer. During these stalls, write traffic from a client node stops, while other client nodes continue to write. The last straggler client may continue to stall even though there is nothing else happening in the cluster to cause it stopping. Eventually, the write traffic resumes and the test completes -- most of the time. This trouble report documents a case where the stall did not clear itself, and where one copy of a file appears corrupted. Avati investigated the system while it was in this state. His observations are captured in the email chain below. We are contuing to study performance-only problem, and will open a separate bugzilla entry for that. Test cluster and load: - 16-node test cluster in the Phoenix lab, "Cluster G" (host[02-32]-rack07) - RHS iso installed: RHSS-2.1.bd-20140502.n.0-RHS-x86_64-DVD1.iso - Each server has one brick, HW Raid6 (9+2) - 10Gb/s ethernet, all on one pair of switches - In this test, each node runs 16 client processes. Each process writes a 1GB file. - The test uses POSIX calls to open, write, fsync and close files. Hadoop, the HDFS interface and Java are not involved. - These stalls appear on writes, not reads. - These stalls appear when writing to Gluster, but not when writing to the individual bricks. On 08/20/2014 02:28 PM, Hank Jakiela wrote: > > Avati, > > I've been digging into these long stalls we see under heavy loads. Now the cluster is stuck in an odd state, with what looks like some data corruption. I wonder if you could take a look. > > The current state is this: > > 1. One node is unable to access one of the files that it had previously written. > > 2. All other nodes in the cluster can see the file correctly. > > 3. The sick node can see other files correctly. However, ls with a wildcard hangs on that directory. > > 4. The two copies of the file are on the first two bricks, but one copy is the wrong size. > > If you have time to poke around on the node, it's 10.1.4.8. I'll text you the root password. > > Thanks > Hank > From : Anand Avati <avati> Subject : Re: Long gluster stalls during heavy write load To : Hank Jakiela <hjakiela> Cc : Diane Feddema <dfeddema> Wed, Aug 20, 2014 07:30 PM From where was the glusterfs RPM installed? I'm trying to install debuginfo RPM of glusterfs and is unable to find.. # debuginfo-install glusterfs-3.4.0.59rhs-1.el6rhs.x86_64 ... No debuginfo packages available to install Here's what I could diagnose from state dumps thus far: It looks like there are a lot of internal locks which are blocked from the client host08-rack07 on the file wf-host08-rack07-29-000000 stored in host02-rack07. The locks are blocked because there is an unreleased lock on the same file from client whose transport pointer address is 0xf2bcd0. I need the debuginfo RPMs so that I can easily find out the client-id string stored in the transport object, using gdb. Finding out which client the transport object (whose pointer is 0xf2bcd0) will let me progress to the next diagnosis step (on that node). I have a hunch it might be the self-heal daemon client, I'll try to confirm that in other ways in the mean time. From : Anand Avati <avati> Subject : Re: Long gluster stalls during heavy write load To : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt> Wed, Aug 20, 2014 08:23 PM Never mind, I could inspect the client id taking hex dumps from the calculated offset of the field in the structure (off the pointer). It appears to be a connection from host08 process pid 27097, which at this point does not exist. I'm assuming it is some kind of a ghost connection which is not yet disconnected for whatever reason. Still diagnosing why we ended up in this situation. From : Anand Avati <avati> Subject : Re: Long gluster stalls during heavy write load To : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt> Wed, Aug 20, 2014 08:25 PM And the reason only host08 is seeing this issue is because there are outstanding (blocked) write requests on this file held up in the write-behind queue. These held up writes are forcing stat() on the file to queue behind them, hence ls hangs when stating our file. On other servers the write-behind queue on the file is empty, and stat requests smoothly flow through. Just FYI. On 08/20/2014 06:35 PM, Hank Jakiela wrote: > > Yes, this is the only node that would have been doing anything with that particular file (nb, hostname in the filename). > > Any idea why the writes were blocked? > > We're seeing a lot of cases where many operations on a particular node get held up for long periods, up to 200 seconds or more, after which things start up again. This is the first example where things didn't start up again. > > Hank > From : Anand Avati <avati> Subject : Re: Long gluster stalls during heavy write load To : Hank Jakiela <hjakiela> Cc : Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp>, Krishnan Parthasarathi <kparthas> Wed, Aug 20, 2014 09:44 PM So, Here's what appears to have happened. During heavy writes, for some reason there was a connection freeze, and glusterfs client (which was doing the write) seems to have detected a freeze and raised a ping timeout, broke the connection and reconnected - all this is visible in the logs. It appears that the "old connection" breakage hasn't registered at the server side socket, and because of non-empty socket buffer in the server, KEEPALIVE hasn't kicked in (which would otherwise raise a disconnection). This is evidenced by inspecting the client-id of the blocked lock and the acquired lock being equal. I'm copying KP and Raghavendra who are familiar with this problem/known behavior (to raise the priority of addressing this problem). For now you could either just restart the volume (easiest), or attempt a manual clear lock. If this is just a test cluster, then restarting the volume would be the simplest. Thanks From : Krishnan Parthasarathi <kparthas> Subject : Re: Long gluster stalls during heavy write load To : Anand Avati <avati> Cc : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp> Wed, Aug 20, 2014 11:17 PM > > So, > Here's what appears to have happened. During heavy writes, for some > reason there was a connection freeze, and glusterfs client (which was > doing the write) seems to have detected a freeze and raised a ping > timeout, broke the connection and reconnected - all this is visible in > the logs. It appears that the "old connection" breakage hasn't > registered at the server side socket, and because of non-empty socket > buffer in the server, KEEPALIVE hasn't kicked in (which would otherwise We must be careful before we conclude that KEEPALIVE mechanism didn't detect 'closed' remote TCP endpoint. It is possible for the glusterfsd process (server) to have not 'attended' to the POLLERR that will be raised, on KEEPALIVE time-out kicking in, simply because the 'epoll thread' was busy. > raise a disconnection). This is evidenced by inspecting the client-id of > the blocked lock and the acquired lock being equal. > ~KP The brick process is otherwise healthy and responding to other requests even now, so the epoll thread is not blocked for sure. Brick has two server_conn_t with the same ->client_id string. One of the server_conn_t is the frame->root->trans of the acquired lock. The other server_conn_t is the frame->root->trans of all the blocked locks. So it is clearly the case that a client has reconnected, but the server is still seeing two active connections. I can't think of any other reason than keepalive not kicking in for the server to still hold on to the first/old connection. Thoughts? From : Diane Feddema <dfeddema> Subject : Re: Long gluster stalls during heavy write load To : Anand Avati <avati> Cc : Hank Jakiela <hjakiela> Wed, Aug 20, 2014 11:47 PM Based on recent emails it looks like you've made a lot of progress debugging the stall on writes problem. Sounds like an interesting network issue with KEEPALIVE not detecting dead peers. The RHS iso installed on Cluster G is: RHSS-2.1.bd-20140502.n.0-RHS-x86_64-DVD1.iso This was the most current stable iso from the RHS group when we setup the cluster. From : Krishnan Parthasarathi <kparthas> Subject : Re: Long gluster stalls during heavy write load To : Anand Avati <avati> Cc : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp> Thu, Aug 21, 2014 01:39 AM With your observations, it does seem that keepalive timer is not kicking in. Meanwhile, I am exploring a way to 'dump' the no. of keepalive probes have been fired, the no. of probes that have to responded to by the remote endpoint etc, per connection object on server, via statedump. This would allow us to assert instead of infer keepalive behaviour. From : Anand Avati <avati> Subject : Re: Long gluster stalls during heavy write load To : Krishnan Parthasarathi <kparthas> Cc : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp> Thu, Aug 21, 2014 02:18 AM That would be a good diagnostic. I guess the larger point is, we need to move towards a more robust mechanism. Like a server initiated PING (only when there are outstanding frames or locks) which is transport independent (works over RDMA etc.) From : Krishnan Parthasarathi <kparthas> Subject : Re: Long gluster stalls during heavy write load To : Anand Avati <avati> Cc : Hank Jakiela <hjakiela>, Diane Feddema <dfeddema>, Stephen Watt <swatt>, Raghavendra Gowdappa <rgowdapp> Thu, Aug 21, 2014 02:28 AM The keepalive timer values are available in /proc/net/tcp{,6}. netstat --timers has a more readable output. I haven't yet managed to find a per-socket interface for fetching the current state of keepalive timers. Without such an interface it doesn't make sense for the glusterfs server to walk through all connections on a system and filter out connections that aren't served by it. Yes, we need to have an heart-beating mechanism at the server-side as well.
hi Hank, This bug seems very much similar to 1089470. Use the following volume option as a workaround: gluster volume set <volname> server.outstanding-rpc-limit 0 Copy pasting the commit description for reference. commit bc434b3ca961757ade8c6093f4ff0dbe4b3a4672 Author: Pranith Kumar K <pkarampu> Date: Wed Apr 23 14:05:10 2014 +0530 rpcsvc: Ignore INODELK/ENTRYLK/LK for throttling Problem: When iozone is in progress, number of blocking inodelks sometimes becomes greater than the threshold number of rpc requests allowed for that client (RPCSVC_DEFAULT_OUTSTANDING_RPC_LIMIT). Subsequent requests from that client will not be read until all the outstanding requests are processed and replied to. But because no more requests are read from that client, unlocks on the already granted locks will never come thus the number of outstanding requests would never come down. This leads to a ping-timeout on the client. Fix: Do not account INODELK/ENTRYLK/LK for throttling Change-Id: I59c6b54e7ec24ed7375ff977e817a9cb73469806 BUG: 1089470 Signed-off-by: Pranith Kumar K <pkarampu> Reviewed-on: http://review.gluster.org/7531 Tested-by: Gluster Build System <jenkins.com> Reviewed-by: Krutika Dhananjay <kdhananj> Reviewed-by: Anand Avati <avati> Pranith
Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the longest stalls. So, this certainly helps. I/O traffic is still somwhat erratic. We'll continue to investigate. It seems flow control with many concurrent write streams still needs to be tuned. Any suggestions? By the way, the documentation of some of these newer tuning parameters is weak to non-existent. What is the default value for server.outstanding-rpc-limit? Was it reduced in response to: http://review.gluster.org/#/c/6696/ ? Thanks for the quick response.
Created attachment 931081 [details] Test results changing server.outstanding-rpc-limit This table shows the impact of setting the RPC limit (server.outstanding-rpc-limit). The test runs on a 16-node gluster cluster. One brick per node, HW Raid6. replication=2. Standard POSIX calls, ie, no Hadoop, HDFS interface or Java involved. On each node, 24 processes write 512MB files. The table shows elapsed time for the runs. For large RPC limit, run times are fast and repeatable. For example, a run time of 60 sec would mean (16 x 24 x 2 x 512MB)/60sec = 6500 MB/sec data written to the bricks. As the RPC limit is reduced, run times become more erratic. Long stalls appear on individual nodes, different nodes on each run. Problems start to appear with an RPC limit of 64. With an RPC limit of 24 (ie, one per file being written) run time become 3X longer than with large RPC limit. With an RPC limit of 16, run times are 5X longer. Setting the RPC limit to 16 can cause severe performance problems for write-heavy workloads. These are common in Hadoop.
#hj> rpm -qa | egrep -i 'rhs|gluster' | sort ctdb-1.0.114.6-1.el6rhs.x86_64 glusterfs-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-api-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-fuse-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-geo-replication-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-libs-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-rdma-3.4.0.59rhs-1.el6rhs.x86_64 glusterfs-server-3.4.0.59rhs-1.el6rhs.x86_64 gluster-swift-1.10.0-2.el6rhs.noarch gluster-swift-account-1.10.0-2.el6rhs.noarch gluster-swift-container-1.10.0-2.el6rhs.noarch gluster-swift-object-1.10.0-2.el6rhs.noarch gluster-swift-plugin-1.10.0-5.el6rhs.noarch gluster-swift-proxy-1.10.0-2.el6rhs.noarch libsmbclient-3.6.9-167.10.el6rhs.x86_64 libsmbclient-devel-3.6.9-167.10.el6rhs.x86_64 python-eventlet-0.9.16-5.el6_rhs2.0.noarch python-greenlet-0.3.1-6.el6_rhs2.0.x86_64 python-rhsm-1.8.17-1.el6_4.x86_64 redhat-storage-logos-60.0.17-1.el6rhs.noarch redhat-storage-server-2.1.2.0-3.el6rhs.noarch rhs-hadoop-2.1.6b2-1.noarch rhs-hadoop-install-0_85-1.el6rhs.noarch samba-3.6.9-167.10.el6rhs.x86_64 samba-client-3.6.9-167.10.el6rhs.x86_64 samba-common-3.6.9-167.10.el6rhs.x86_64 samba-doc-3.6.9-167.10.el6rhs.x86_64 samba-domainjoin-gui-3.6.9-167.10.el6rhs.x86_64 samba-glusterfs-3.6.9-167.10.el6rhs.x86_64 samba-swat-3.6.9-167.10.el6rhs.x86_64 samba-winbind-3.6.9-167.10.el6rhs.x86_64 samba-winbind-clients-3.6.9-167.10.el6rhs.x86_64 samba-winbind-devel-3.6.9-167.10.el6rhs.x86_64 samba-winbind-krb5-locator-3.6.9-167.10.el6rhs.x86_64 vdsm-4.13.0-24.el6rhs.x86_64 vdsm-cli-4.13.0-24.el6rhs.noarch vdsm-gluster-4.13.0-24.el6rhs.noarch vdsm-python-4.13.0-24.el6rhs.x86_64 vdsm-python-cpopen-4.13.0-24.el6rhs.x86_64 vdsm-reg-4.13.0-24.el6rhs.noarch vdsm-xmlrpc-4.13.0-24.el6rhs.noarch
#hj> gluster volume info HadoopVol Volume Name: HadoopVol Type: Distributed-Replicate Volume ID: 8cae7404-5bb2-4582-9eb1-ba1cf80b1797 Status: Started Number of Bricks: 8 x 2 = 16 Transport-type: tcp Bricks: Brick1: host02-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick2: host04-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick3: host06-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick4: host08-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick5: host10-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick6: host12-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick7: host14-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick8: host16-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick9: host18-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick10: host20-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick11: host22-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick12: host24-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick13: host26-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick14: host28-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick15: host30-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Brick16: host32-rack07.scale.openstack.engineering.redhat.com:/mnt/brick1/HadoopVol Options Reconfigured: performance.write-behind: ON server.outstanding-rpc-limit: 512 performance.io-thread-count: 32 performance.flush-behind: ON performance.write-behind-window-size: 128MB performance.cache-size: 1GB performance.cache-refresh-timeout: 60 performance.stat-prefetch: off performance.quick-read: off cluster.eager-lock: on
(In reply to Hank Jakiela from comment #3) > Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the > longest stalls. So, this certainly helps. > > I/O traffic is still somwhat erratic. We'll continue to investigate. It > seems flow control with many concurrent write streams still needs to be > tuned. Any suggestions? > > By the way, the documentation of some of these newer tuning parameters is > weak to non-existent. What is the default value for > server.outstanding-rpc-limit? Was it reduced in response to: > http://review.gluster.org/#/c/6696/ ? > > Thanks for the quick response. I will get this option documented. From source code this is the description we have: This option is a parameter to throttle the number of incoming RPC requests from a client. 0 means no limit (can potentially run out of memory) Default value: 64 minimum limit we can set is: 1 maximum limit we can set is: 65536 To set it to unlimited set the option value to: 0 16 is the default value for gluster NFS server. Are these tests being run on gluster nfs mount or gluster native mount (i.e. fuse)?
(In reply to Pranith Kumar K from comment #7) > (In reply to Hank Jakiela from comment #3) > > Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the > > longest stalls. So, this certainly helps. > > > > I/O traffic is still somwhat erratic. We'll continue to investigate. It > > seems flow control with many concurrent write streams still needs to be > > tuned. Any suggestions? > > > > By the way, the documentation of some of these newer tuning parameters is > > weak to non-existent. What is the default value for > > server.outstanding-rpc-limit? Was it reduced in response to: > > http://review.gluster.org/#/c/6696/ ? > > > > Thanks for the quick response. > > I will get this option documented. > From source code this is the description we have: > This option is a parameter to throttle the number of incoming RPC requests > from a client. 0 means no limit (can potentially run out of memory) > > Default value: 64 > minimum limit we can set is: 1 > maximum limit we can set is: 65536 > To set it to unlimited set the option value to: 0 > > 16 is the default value for gluster NFS server. Are these tests being run on > gluster nfs mount or gluster native mount (i.e. fuse)? HJ> We are not using NFS in this case. Just the FUSE mount.
(In reply to Hank Jakiela from comment #8) > (In reply to Pranith Kumar K from comment #7) > > (In reply to Hank Jakiela from comment #3) > > > Setting server.outstanding-rpc-limit to zero (or 512) seems to eliminate the > > > longest stalls. So, this certainly helps. > > > > > > I/O traffic is still somwhat erratic. We'll continue to investigate. It > > > seems flow control with many concurrent write streams still needs to be > > > tuned. Any suggestions? > > > > > > By the way, the documentation of some of these newer tuning parameters is > > > weak to non-existent. What is the default value for > > > server.outstanding-rpc-limit? Was it reduced in response to: > > > http://review.gluster.org/#/c/6696/ ? > > > > > > Thanks for the quick response. > > > > I will get this option documented. > > From source code this is the description we have: > > This option is a parameter to throttle the number of incoming RPC requests > > from a client. 0 means no limit (can potentially run out of memory) > > > > Default value: 64 > > minimum limit we can set is: 1 > > maximum limit we can set is: 65536 > > To set it to unlimited set the option value to: 0 > > > > 16 is the default value for gluster NFS server. Are these tests being run on > > gluster nfs mount or gluster native mount (i.e. fuse)? > > HJ> We are not using NFS in this case. Just the FUSE mount. In that case no impact because of the patch http://review.gluster.org/#/c/6696/
The needinfo request[s] on this closed bug have been removed as they have been unresolved for 1000 days