Red Hat Bugzilla – Bug 508749
GFS2 operations hung when one node is powered off while writing to fs
Last modified: 2010-06-11 09:32:07 EDT
Created attachment 349831 [details]
Description of problem:
GFS2 operations get hung when a node cluster is powered off, entering in D state (uninterrumpible).
Version-Release number of selected component (if applicable):
Steps to Reproduce:
1. start writing to fs from all cluster nodes
2. power off one node
3. any operation over filesystem in remaining nodes gets hanged. For example, ls .
Operation does not return anything, and stays waiting forever.
Operation ends as usual, giving appropiate answer.
Created attachment 349832 [details]
program code to test writes/reads
This is program used to inject load into the filesystem.
gcc -o s s.c
./s 10000 5000 4000
That will write to 10000 random directories, 5000 files into each directory and up to 4000 bytes per file.
You can download an x86_64 kernel with the fix for bz #506140, to see if it's related, at
We've tried this kernel version fixed for bz #506140, and behaviour is exactly the same: operations with GFS2 filesystem get locked when one member of the cluster is powered off.
I've tried to reproduce using the program in comment #1, but I've not had success.
I've run the program on the gfs2 mountpoint from all 3 nodes of my cluster. On power-cycling one node, there's a brief pause of a few seconds but the other nodes resume work as expected.
I'm running slightly newer code. Can you post the 'modinfo gfs2' output and possibly the 'sys-rq t' output when the processes hang?
Also, are you able to hit this easily with the steps highlighted above?
Please also look for fencing related messages in the logs of the nodes.
We encountered the same problem with a cluster of 18 nodes, last week, when one of the node rebooted accidentally. The whole GFS2 was hung.
Just a quick ping to see if there is any updated information relating to this bug yet?
Is there any more information regarding this issue? If not we'll have to close as INSUFFICIENT_DATA shortly.
Has the customer's system been checked to ensure that they don't have an old gfs2 kmod hanging around? It is just as possible that the issue relates to the cluster infrastructure as gfs2 without any further information to determine that.
Has the customer's cluster been brought within the supported size limits now?
Our customer is no longer maintaining this platform in active, and we won't be able to gather more information.
Maybe this can be kept open for the other 18-16 nodes cluster.
Thanks a lot for your help.
There isn't a lot of point keeping it open. We don't support cluster sizes of over 16 nodes so if there is no more info about this likely to appear, then we should close it.
If something does drop up, then by all means reopen this bug and we'll look into it.