Bug 480013

Summary:	GFS: httpd processes get hanged in D state
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Balázs Miklós <mbalazs>
Component:	GFS-kernel	Assignee:	Robert Peterson <rpeterso>
Status:	CLOSED INSUFFICIENT_DATA	QA Contact:	Cluster QE <mspqa-list>
Severity:	high	Docs Contact:
Priority:	low
Version:	4	CC:	edamato, mbalazs, szivan
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2009-07-15 18:47:01 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Balázs Miklós 2009-01-14 15:50:58 UTC

Description of problem:

We are experiencing GFS problems on a cluster hosting a high-traffic web
site. The servers are running RHEL 4.6, the hosted site is a dynamic PHP
based web application. GFS is used for the document root, for PHP
session files and for storing pre-compiled caches of web page templates.

The issue we are having is that occasionally all of the httpd processes
on every node get hung in D state. This happens about once in 2 months
for most of the time, but for ex. it happened twice in 2 weeks during
the peak traffic period of the holidays. The only hint that something
went wrong with GFS is the following three lines in the message log:

Jan  4 20:36:50 fe20 kernel: dlm: cache: cancel reply ret 0
Jan  4 20:36:50 fe20 kernel: lock_dlm: unlock sb_status 0 8,60 flags 0
Jan  4 20:36:50 fe20 kernel: dlm: cache: process_lockqueue_reply id
221e016b state 0

Dec 23 13:48:43 fe20 kernel: dlm: cache: cancel reply ret 0
Dec 23 13:48:43 fe20 kernel: lock_dlm: unlock sb_status 0 2,c3f859 flags
0
Dec 23 13:48:43 fe20 kernel: dlm: cache: process_lockqueue_reply id
3fda0191 state 0

These messages have only appeared on one node. Other than the hanged
apache processes, everything seemed to be normal, no fence action was
taken, but the only way to recover from this was to fence the fe20 node
by hand. The apache processes couldn't be killed.

Unfortunately we didn't had time for a deeper investigation of the
servers' condition during the hang because it is a production system and
getting the site back online was the top priority. I suspect that these
messages might not be enough to find the problem... If you need any
additional information from during the hanged state, tell us and we'll
make sure it will be collected next time this happens (which can be in a
couple of weeks or so..).

Versions of GFS and dlm:
GFS-6.1.15-1.x86_64
GFS-kernel-smp-2.6.9-72.2.0.2.x86_64
dlm-1.0.7-1.x86_64
dlm-kernel-smp-2.6.9-52.4.x86_64

Comment 1 Robert Peterson 2009-01-14 17:23:22 UTC

In order to analyze the hang, I need to get the following information:

1. A dump of all GFS "glocks" from all nodes in the cluster. Please see:

http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_gfs

2. A dump of all dlm locks from all nodes in the cluster.  Please see:

http://sources.redhat.com/cluster/wiki/FAQ/LockManager#lock_dump_dlm

3. Process call trace output from all nodes in the cluster from sysrq-t
   output.  The easiest way to collect that is to do this command:

echo "t" > /proc/sysrq-trigger

   and capture/save the console output.  Please be aware that triggering
   the sysrq-t output can cause the system to hang and be unresponsive.
   Therefore, other nodes may think the system is dead and try to fence
   it.  So you may have to temporarily disable fencing in order to
   collect the complete output.

Comment 2 Robert Peterson 2009-07-15 18:47:01 UTC

In NEEDINFO for six months; closing INSUFFICIENT_DATA.