508749 – GFS2 operations hung when one node is powered off while writing to fs

Bug 508749 - GFS2 operations hung when one node is powered off while writing to fs

Summary: GFS2 operations hung when one node is powered off while writing to fs

Keywords:
Status:	CLOSED INSUFFICIENT_DATA
Alias:	None
Product:	Red Hat Enterprise Linux 5
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	5.3
Hardware:	x86_64
OS:	Linux
Priority:	medium
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Abhijith Das
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	533192
TreeView+	depends on / blocked

Reported:	2009-06-29 17:43 UTC by Rafael Godínez Pérez
Modified:	2010-06-11 13:32 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-06-11 13:32:07 UTC
Target Upstream Version:
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
cat /sys/kernel/debug/gfs2/<fsname>/glocks (2.88 MB, text/plain) 2009-06-29 17:43 UTC, Rafael Godínez Pérez	no flags	Details
program code to test writes/reads (2.62 KB, text/x-csrc) 2009-06-29 17:46 UTC, Rafael Godínez Pérez	no flags	Details
View All

Description Rafael Godínez Pérez 2009-06-29 17:43:52 UTC

Created attachment 349831 [details]
cat /sys/kernel/debug/gfs2/<fsname>/glocks

Description of problem:
GFS2 operations get hung when a node cluster is powered off, entering in D state (uninterrumpible).


Version-Release number of selected component (if applicable):
5.3

How reproducible:
Always

Steps to Reproduce:
1. start writing to fs from all cluster nodes
2. power off one node
3. any operation over filesystem in remaining nodes gets hanged. For example, ls .
  
Actual results:
Operation does not return anything, and stays waiting forever.

Expected results:
Operation ends as usual, giving appropiate answer.

Additional info:

Comment 1 Rafael Godínez Pérez 2009-06-29 17:46:50 UTC

Created attachment 349832 [details]
program code to test writes/reads

This is program used to inject load into the filesystem.

gcc -o s s.c
cd <filesystem>
./s 10000 5000 4000

That will write to 10000 random directories, 5000 files into each directory and up to 4000 bytes per file.

Comment 2 Ben Marzinski 2009-06-30 02:57:09 UTC

You can download an x86_64 kernel with the fix for bz #506140, to see if it's related, at
http://people.redhat.com/~bmarzins/gfs2_kernel/kernel-2.6.18-152.el5.506140try2.x86_64.rpm

Comment 3 Rafael Godínez Pérez 2009-06-30 17:23:38 UTC

We've tried this kernel version fixed for bz #506140, and behaviour is exactly the same: operations with GFS2 filesystem get locked when one member of the cluster is powered off.

Comment 4 Abhijith Das 2009-06-30 19:13:48 UTC

I've tried to reproduce using the program in comment #1, but I've not had success.
I've run the program on the gfs2 mountpoint from all 3 nodes of my cluster. On power-cycling one node, there's a brief pause of a few seconds but the other nodes resume work as expected.

I'm running slightly newer code. Can you post the 'modinfo gfs2' output and possibly the 'sys-rq t' output when the processes hang?

Also, are you able to hit this easily with the steps highlighted above?

Comment 5 Abhijith Das 2009-07-14 15:15:28 UTC

Please also look for fencing related messages in the logs of the nodes.

Comment 6 Richard Vigeant 2009-07-15 22:37:13 UTC

We encountered the same problem with a cluster of 18 nodes, last week, when one of the node rebooted accidentally. The whole GFS2 was hung.

Comment 12 Steve Whitehouse 2010-04-21 11:15:50 UTC

Just a quick ping to see if there is any updated information relating to this bug yet?

Comment 13 Steve Whitehouse 2010-06-11 09:58:48 UTC

Is there any more information regarding this issue? If not we'll have to close as INSUFFICIENT_DATA shortly.

Has the customer's system been checked to ensure that they don't have an old gfs2 kmod hanging around? It is just as possible that the issue relates to the cluster infrastructure as gfs2 without any further information to determine that.

Has the customer's cluster been brought within the supported size limits now?

Comment 14 Rafael Godínez Pérez 2010-06-11 13:19:30 UTC

Our customer is no longer maintaining this platform in active, and we won't be able to gather more information.
Maybe this can be kept open for the other 18-16 nodes cluster.
Thanks a lot for your help.

Comment 15 Steve Whitehouse 2010-06-11 13:32:07 UTC

There isn't a lot of point keeping it open. We don't support cluster sizes of over 16 nodes so if there is no more info about this likely to appear, then we should close it.

If something does drop up, then by all means reopen this bug and we'll look into it.

Note You need to log in before you can comment on or make changes to this bug.