Bug 832609 - Glusterfsd hangs if brick filesystem becomes unresponsive, causing all clients to lock up
Glusterfsd hangs if brick filesystem becomes unresponsive, causing all client...
Status: CLOSED DEFERRED
Product: GlusterFS
Classification: Community
Component: posix (Show other bugs)
3.1.7
Unspecified Unspecified
high Severity urgent
: ---
: ---
Assigned To: Raghavendra Bhat
: Triaged
: 764756 (view as bug list)
Depends On:
Blocks: 852578
  Show dependency treegraph
 
Reported: 2012-06-15 18:30 EDT by Louis Zuckerman
Modified: 2015-05-26 08:33 EDT (History)
17 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 852578 (view as bug list)
Environment:
Last Closed: 2014-12-14 14:40:33 EST
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
script for checking backend hang killing brick processes (1.21 KB, application/x-shellscript)
2013-01-23 03:52 EST, Raghavendra Bhat
no flags Details
script for checking backend hang killing brick processes (modified) (1.96 KB, application/x-shellscript)
2013-05-08 05:45 EDT, Raghavendra Bhat
no flags Details

  None (edit)
Description Louis Zuckerman 2012-06-15 18:30:56 EDT
Description of problem:

If a brick filesystem stops responding and operations block for a "long" period of time during which no operations complete, then glusterfsd should release it's socket and exit.

The problem now is that it simply hangs, waiting for ever for operations to complete.  This causes two problems:

1. The whole volume freezes because clients will try to access this hung brick.  It doesn't matter if the volume has replication, because the bad brick is never detected as failing.

2. Killing the glusterfsd process (which allows clients to continue with the good replica) then causes the glusterfsd process to become a zombie and keep holding its TCP socket.  This means that a new glusterfsd process can not start, because it can not acquire the TCP socket for that brick's port number.  So the only way to restore that brick to service is to reboot the server.

Version-Release number of selected component (if applicable):


How reproducible:

Should be easily reproducible but I'm not sure exactly how.  In my case the problem was the Amazon EC2 outage of June 15 2012 in the Northern Virginia (US-EAST-1) region.

Amazon failed an EBS volume that was mounted and in use as a glusterfs brick.  Amazon froze all access to that disk though it was still "attached" to my instance, so the operating system did not detect any change in hardware.  Operations simply hung on the device, for ever.

Actual results:

Glusterfsd waits forever for operations to complete on the brick filesystem.  This means clients don't know there's a problem so they don't stop using that brick.

Killing glusterfsd causes it to become a zombie and hold on to the TCP socket, which prevents an admin from restarting the process until a reboot.

Expected results:

Glusterfsd exits if brick filesystem operations have been pending for a "long" time and no operations have succeeded during that time.  Once glusterfsd exits clients will stop using that brick & continue to work with the remaining replica(s) still in service.  Glusterfsd should also close its socket before exiting so an admin can start a replacement glusterfsd process once the filesystem has been fixed.

Thank you very much.
Comment 1 Louis Zuckerman 2012-06-15 18:37:32 EDT
Maybe the criteria should just be "if operations have been pending for a 'long' time," without the added constraint that "no other operations have completed in that time."

If a brick is degraded, less than 100% healthy, I would rather see it taken out of service completely than kept around longer.  Fail fast.
Comment 2 Joe Julian 2012-06-15 18:43:06 EDT
I've had the same problem. Others have as well and filed a bug report, but it was closed as wontfix. This should be a high priority issue as it does stop all access to that volume indefinately.
Comment 3 Louis Zuckerman 2012-06-16 07:09:52 EDT
Please also backport the fix to the 3.1 and 3.2 release branches and issue new patch releases of both.
Comment 4 Rudi Meyer 2012-07-09 10:01:41 EDT
I can reproduce exactly this on AWS EC2 by force-detaching an EBS volume.
Comment 5 Amar Tumballi 2012-09-04 04:47:12 EDT
Sorry about delay in working on this.. Will take up this soon.
Comment 6 Anthony DeChiaro 2012-10-11 10:35:52 EDT
I'm seeing a similar issue on my end as well...  in my case one of the glusterfsd processes jumps up to 100% and gluster reports the brick as not connected.  I don't think I have issues with the entire volume though, as the other bricks seem to be functioning fine.  I'm running a distributed replicate volume and the other replicate (of the disconnected brick) receives data just fine.  A 'volume heal info' just shows an increasing number of entries for the disconnected brick.

Problem #2 I'm seeing on my end as well.  If I stop the glusterd service, the "broken" glusterfsd process remains behind and killing it creates a zombie as mentioned previously.  Also, I only seem to notice this problem under load.
Comment 7 Anthony DeChiaro 2012-10-11 13:43:24 EDT
I'd also like to note on my end a "gluster volume status" shows the brick in question as online, but the "gluster volume heal info" shows it as not connected.
Comment 8 JMW 2012-11-19 14:28:01 EST
Is this targeted for any particular milestone or tree?
Comment 9 Raghavendra Bhat 2012-12-04 05:10:01 EST
*** Bug 764756 has been marked as a duplicate of this bug. ***
Comment 10 Amar Tumballi 2012-12-26 01:09:47 EST
Sorry for delay in updating the bug. We have been thinking about this bug, but it is not very simple mainly because we would not be getting any 'notification' from the backend for the hang. If we ever get such notification, then it becomes easy to trigger a 'exit()' from the process itself in such cases (thus closing all the socket connections too).

We could get rid of this by reducing the 'frame-timeout' option, but that won't solve the problem completely. One of the thoughts of solving this is by running a script (per brick process), which keeps doing a 'df -h ${brick-dir}' in background, and if it doesn't get a reply in some '${brick-timeout}' (thinking of 30second default for now? (or someone wants it 42?) with it being configurable), kill the brick process.

Let me know if that is fine.
Comment 11 Raghavendra Bhat 2013-01-23 03:52:31 EST
Created attachment 685710 [details]
script for checking backend hang killing brick processes

Please execute the attached script when the problem arises and see if it works for you.
Comment 12 Amar Tumballi 2013-02-15 06:48:47 EST
Need 'Verification' for this bug on script attached in comment #11, as it seems to work for us to solve the issue. Once verified will move it to appropriate status.
Comment 13 Louis Zuckerman 2013-02-15 19:58:15 EST
I tried but was not able to verify this solution.  Here's the test procedure I followed...

1. Launched EC2 instance with two EBS volumes attached
2. Installed glusterfs and created a replica 2 volume with the two local bricks
3. Mounted the volume locally
4. Started screen with two windows, one writing the date to a file on the client every second, the other with tail -f on the date file
5. In another session, started the script from comment #11
6. Did a "force detach" of one of the two EBS volumes (a brick)

At this point I observed the writer & reader hang.  The script however did not kill the bad brick process.

For some strange reason, after force detaching the EBS volume, df still returned quickly -- with information about the detached brick's mount point.  I could even still ls the detached brick's mount point and see the date file.  When I tried to tail the date file on the detached brick though the tail hung.

I did a kill and a kill -9 on the detached brick's glusterfsd process which sent it into zombie/defunct state.  After this the client did detect a ping timeout and continued with the remaining brick.

--- Last line from the brick log showing that I killed it ---

[2013-02-16 00:37:13.664909] W [glusterfsd.c:831:cleanup_and_exit] (-->/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f17449bacbd] (-->/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f1744c8de9a] (-->/usr/sbin/glusterfsd(glusterfs_sigwaiter+0xd5) [0x7f1745773a55]))) 0-: received signum (15), shutting down

--- Client log recognizing the brick process died ---

[2013-02-16 00:37:58.800634] C [client-handshake.c:126:rpc_client_ping_timer_expired] 0-foo-client-1: server 10.70.7.170:24010 has not responded in the last 42 seconds, disconnecting.

---

Notice that the brick log reports signal 15 at :13 seconds, but the client detected ping timeout 45 seconds later at :58.  That extra 3 seconds may have been how long it took me to check ps again after the first kill, and kill a second time with -9, but I'm not sure about that.

Thanks & I hope this feedback is helpful.
Comment 14 Louis Zuckerman 2013-02-15 20:05:36 EST
That test in previous comment was using glusterfs 3.3.1 on ubuntu precise (12.04) by the way.
Comment 15 Amar Tumballi 2013-02-17 23:06:46 EST
Thanks Louis :-)

That one gave much better understanding of the problem. Seems like we have to check for read() [cat $filename] hang, and not statfs() [df -h $brick].  Will try to improve the script.

Meantime, process going to zombie state is because a systemcall from the process is not yet complete, and even -15 doesn't kill it all the time. We will also debug why even after closing main thread, we took 'ping-timeout' time to disconnect.
Comment 16 hans 2013-04-08 08:06:34 EDT
I ran into this bug with 3.3.1 when I started a replace-brick causing the read-from brick to be overloaded with read IOPS (about 2M directories in the volume). The entire gluster volume was inaccessible for several minutes.

Now //sbin/glusterfs -f/var/lib/glusterd/vols/vol01/rb_dst_brick.vol on the target node is eating up 100% CPU, using 0 IOPS and showing no strace output. At least the volume is accessible again, but the replace-brick is stuck.
Comment 17 Rudi Meyer 2013-04-29 07:59:06 EDT
If trying to reproduce the problem in a Amazon Web Services setup using EBS disks, one should be aware of the, kinda related, problem of failure handling of EBS disk on never kernels: https://forums.aws.amazon.com/thread.jspa?threadID=110756
Comment 18 Raghavendra Bhat 2013-05-08 05:45:58 EDT
Created attachment 745151 [details]
script for checking backend hang killing brick processes (modified)

Previous script bit modified. Please check whether this works or not.
Comment 19 Niels de Vos 2013-07-11 06:00:17 EDT
Is this a duplicate of Bug 971774?
Comment 20 Niels de Vos 2014-11-27 09:54:56 EST
The version that this bug has been reported against, does not get any updates from the Gluster Community anymore. Please verify if this report is still valid against a current (3.4, 3.5 or 3.6) release and update the version, or close this bug.

If there has been no update before 9 December 2014, this bug will get automatocally closed.
Comment 21 Niels de Vos 2015-05-26 08:33:37 EDT
This bug has been CLOSED, and there has not been a response to the requested NEEDINFO in more than 4 weeks. The NEEDINFO flag is now getting cleared so that our Bugzilla household is getting more in order.

Note You need to log in before you can comment on or make changes to this bug.