Red Hat Bugzilla – Bug 480013
GFS: httpd processes get hanged in D state
Last modified: 2010-01-11 22:21:46 EST
Description of problem:
We are experiencing GFS problems on a cluster hosting a high-traffic web
site. The servers are running RHEL 4.6, the hosted site is a dynamic PHP
based web application. GFS is used for the document root, for PHP
session files and for storing pre-compiled caches of web page templates.
The issue we are having is that occasionally all of the httpd processes
on every node get hung in D state. This happens about once in 2 months
for most of the time, but for ex. it happened twice in 2 weeks during
the peak traffic period of the holidays. The only hint that something
went wrong with GFS is the following three lines in the message log:
Jan 4 20:36:50 fe20 kernel: dlm: cache: cancel reply ret 0
Jan 4 20:36:50 fe20 kernel: lock_dlm: unlock sb_status 0 8,60 flags 0
Jan 4 20:36:50 fe20 kernel: dlm: cache: process_lockqueue_reply id
221e016b state 0
Dec 23 13:48:43 fe20 kernel: dlm: cache: cancel reply ret 0
Dec 23 13:48:43 fe20 kernel: lock_dlm: unlock sb_status 0 2,c3f859 flags
Dec 23 13:48:43 fe20 kernel: dlm: cache: process_lockqueue_reply id
3fda0191 state 0
These messages have only appeared on one node. Other than the hanged
apache processes, everything seemed to be normal, no fence action was
taken, but the only way to recover from this was to fence the fe20 node
by hand. The apache processes couldn't be killed.
Unfortunately we didn't had time for a deeper investigation of the
servers' condition during the hang because it is a production system and
getting the site back online was the top priority. I suspect that these
messages might not be enough to find the problem... If you need any
additional information from during the hanged state, tell us and we'll
make sure it will be collected next time this happens (which can be in a
couple of weeks or so..).
Versions of GFS and dlm:
In order to analyze the hang, I need to get the following information:
1. A dump of all GFS "glocks" from all nodes in the cluster. Please see:
2. A dump of all dlm locks from all nodes in the cluster. Please see:
3. Process call trace output from all nodes in the cluster from sysrq-t
output. The easiest way to collect that is to do this command:
echo "t" > /proc/sysrq-trigger
and capture/save the console output. Please be aware that triggering
the sysrq-t output can cause the system to hang and be unresponsive.
Therefore, other nodes may think the system is dead and try to fence
it. So you may have to temporarily disable fencing in order to
collect the complete output.
In NEEDINFO for six months; closing INSUFFICIENT_DATA.