Bug 508749 - GFS2 operations hung when one node is powered off while writing to fs
GFS2 operations hung when one node is powered off while writing to fs
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel (Show other bugs)
5.3
x86_64 Linux
medium Severity high
: rc
: ---
Assigned To: Abhijith Das
Cluster QE
:
Depends On:
Blocks: 533192
  Show dependency treegraph
 
Reported: 2009-06-29 13:43 EDT by Rafael Godínez Pérez
Modified: 2010-06-11 09:32 EDT (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-06-11 09:32:07 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
cat /sys/kernel/debug/gfs2/<fsname>/glocks (2.88 MB, text/plain)
2009-06-29 13:43 EDT, Rafael Godínez Pérez
no flags Details
program code to test writes/reads (2.62 KB, text/x-csrc)
2009-06-29 13:46 EDT, Rafael Godínez Pérez
no flags Details

  None (edit)
Description Rafael Godínez Pérez 2009-06-29 13:43:52 EDT
Created attachment 349831 [details]
cat /sys/kernel/debug/gfs2/<fsname>/glocks

Description of problem:
GFS2 operations get hung when a node cluster is powered off, entering in D state (uninterrumpible).


Version-Release number of selected component (if applicable):
5.3

How reproducible:
Always

Steps to Reproduce:
1. start writing to fs from all cluster nodes
2. power off one node
3. any operation over filesystem in remaining nodes gets hanged. For example, ls .
  
Actual results:
Operation does not return anything, and stays waiting forever.

Expected results:
Operation ends as usual, giving appropiate answer.

Additional info:
Comment 1 Rafael Godínez Pérez 2009-06-29 13:46:50 EDT
Created attachment 349832 [details]
program code to test writes/reads

This is program used to inject load into the filesystem.

gcc -o s s.c
cd <filesystem>
./s 10000 5000 4000

That will write to 10000 random directories, 5000 files into each directory and up to 4000 bytes per file.
Comment 2 Ben Marzinski 2009-06-29 22:57:09 EDT
You can download an x86_64 kernel with the fix for bz #506140, to see if it's related, at
http://people.redhat.com/~bmarzins/gfs2_kernel/kernel-2.6.18-152.el5.506140try2.x86_64.rpm
Comment 3 Rafael Godínez Pérez 2009-06-30 13:23:38 EDT
We've tried this kernel version fixed for bz #506140, and behaviour is exactly the same: operations with GFS2 filesystem get locked when one member of the cluster is powered off.
Comment 4 Abhijith Das 2009-06-30 15:13:48 EDT
I've tried to reproduce using the program in comment #1, but I've not had success.
I've run the program on the gfs2 mountpoint from all 3 nodes of my cluster. On power-cycling one node, there's a brief pause of a few seconds but the other nodes resume work as expected.

I'm running slightly newer code. Can you post the 'modinfo gfs2' output and possibly the 'sys-rq t' output when the processes hang?

Also, are you able to hit this easily with the steps highlighted above?
Comment 5 Abhijith Das 2009-07-14 11:15:28 EDT
Please also look for fencing related messages in the logs of the nodes.
Comment 6 Richard Vigeant 2009-07-15 18:37:13 EDT
We encountered the same problem with a cluster of 18 nodes, last week, when one of the node rebooted accidentally. The whole GFS2 was hung.
Comment 12 Steve Whitehouse 2010-04-21 07:15:50 EDT
Just a quick ping to see if there is any updated information relating to this bug yet?
Comment 13 Steve Whitehouse 2010-06-11 05:58:48 EDT
Is there any more information regarding this issue? If not we'll have to close as INSUFFICIENT_DATA shortly.

Has the customer's system been checked to ensure that they don't have an old gfs2 kmod hanging around? It is just as possible that the issue relates to the cluster infrastructure as gfs2 without any further information to determine that.

Has the customer's cluster been brought within the supported size limits now?
Comment 14 Rafael Godínez Pérez 2010-06-11 09:19:30 EDT
Our customer is no longer maintaining this platform in active, and we won't be able to gather more information.
Maybe this can be kept open for the other 18-16 nodes cluster.
Thanks a lot for your help.
Comment 15 Steve Whitehouse 2010-06-11 09:32:07 EDT
There isn't a lot of point keeping it open. We don't support cluster sizes of over 16 nodes so if there is no more info about this likely to appear, then we should close it.

If something does drop up, then by all means reopen this bug and we'll look into it.

Note You need to log in before you can comment on or make changes to this bug.