Bug 508749

Summary: GFS2 operations hung when one node is powered off while writing to fs
Product: Red Hat Enterprise Linux 5 Reporter: Rafael Godínez Pérez <rgodinez>
Component: kernelAssignee: Abhijith Das <adas>
Status: CLOSED INSUFFICIENT_DATA QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 5.3CC: adas, bmarzins, jwest, richard.vigeant, rpeterso, rwheeler, swhiteho, tao
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-06-11 13:32:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 533192    
Attachments:
Description Flags
cat /sys/kernel/debug/gfs2/<fsname>/glocks
none
program code to test writes/reads none

Description Rafael Godínez Pérez 2009-06-29 17:43:52 UTC
Created attachment 349831 [details]
cat /sys/kernel/debug/gfs2/<fsname>/glocks

Description of problem:
GFS2 operations get hung when a node cluster is powered off, entering in D state (uninterrumpible).


Version-Release number of selected component (if applicable):
5.3

How reproducible:
Always

Steps to Reproduce:
1. start writing to fs from all cluster nodes
2. power off one node
3. any operation over filesystem in remaining nodes gets hanged. For example, ls .
  
Actual results:
Operation does not return anything, and stays waiting forever.

Expected results:
Operation ends as usual, giving appropiate answer.

Additional info:

Comment 1 Rafael Godínez Pérez 2009-06-29 17:46:50 UTC
Created attachment 349832 [details]
program code to test writes/reads

This is program used to inject load into the filesystem.

gcc -o s s.c
cd <filesystem>
./s 10000 5000 4000

That will write to 10000 random directories, 5000 files into each directory and up to 4000 bytes per file.

Comment 2 Ben Marzinski 2009-06-30 02:57:09 UTC
You can download an x86_64 kernel with the fix for bz #506140, to see if it's related, at
http://people.redhat.com/~bmarzins/gfs2_kernel/kernel-2.6.18-152.el5.506140try2.x86_64.rpm

Comment 3 Rafael Godínez Pérez 2009-06-30 17:23:38 UTC
We've tried this kernel version fixed for bz #506140, and behaviour is exactly the same: operations with GFS2 filesystem get locked when one member of the cluster is powered off.

Comment 4 Abhijith Das 2009-06-30 19:13:48 UTC
I've tried to reproduce using the program in comment #1, but I've not had success.
I've run the program on the gfs2 mountpoint from all 3 nodes of my cluster. On power-cycling one node, there's a brief pause of a few seconds but the other nodes resume work as expected.

I'm running slightly newer code. Can you post the 'modinfo gfs2' output and possibly the 'sys-rq t' output when the processes hang?

Also, are you able to hit this easily with the steps highlighted above?

Comment 5 Abhijith Das 2009-07-14 15:15:28 UTC
Please also look for fencing related messages in the logs of the nodes.

Comment 6 Richard Vigeant 2009-07-15 22:37:13 UTC
We encountered the same problem with a cluster of 18 nodes, last week, when one of the node rebooted accidentally. The whole GFS2 was hung.

Comment 12 Steve Whitehouse 2010-04-21 11:15:50 UTC
Just a quick ping to see if there is any updated information relating to this bug yet?

Comment 13 Steve Whitehouse 2010-06-11 09:58:48 UTC
Is there any more information regarding this issue? If not we'll have to close as INSUFFICIENT_DATA shortly.

Has the customer's system been checked to ensure that they don't have an old gfs2 kmod hanging around? It is just as possible that the issue relates to the cluster infrastructure as gfs2 without any further information to determine that.

Has the customer's cluster been brought within the supported size limits now?

Comment 14 Rafael Godínez Pérez 2010-06-11 13:19:30 UTC
Our customer is no longer maintaining this platform in active, and we won't be able to gather more information.
Maybe this can be kept open for the other 18-16 nodes cluster.
Thanks a lot for your help.

Comment 15 Steve Whitehouse 2010-06-11 13:32:07 UTC
There isn't a lot of point keeping it open. We don't support cluster sizes of over 16 nodes so if there is no more info about this likely to appear, then we should close it.

If something does drop up, then by all means reopen this bug and we'll look into it.