Bug 238447

Summary:	updatedb causing GFS filesystem to hang (RHEL 4 U3)
Product:	[Retired] Red Hat Cluster Suite	Reporter:	Sam Knuth <sfolkwil>
Component:	gfs	Assignee:	Abhijith Das <adas>
Status:	CLOSED NOTABUG	QA Contact:	GFS Bugs <gfs-bugs>
Severity:	high	Docs Contact:
Priority:	high
Version:	4	CC:	hlawatschek, sfolkwil
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-11-11 21:28:34 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Comment 1 Wendy Cheng 2007-05-01 05:48:50 UTC

Run on RHEL4U4 with updatedb and bonnie at the same time.. Bonnie keeps
failing with write errors. The following entries are logged in kernel 
messages file:

Apr 30 23:55:44 kanderso-xen-01 sshd(pam_unix)[7940]: session opened for user
root by root(uid=0)
Apr 30 23:57:09 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
35003aa state 0
May  1 00:01:41 kanderso-xen-01 sshd(pam_unix)[8036]: session opened for user
root by root(uid=0)
May  1 00:16:59 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
3dd016e state 0
May  1 00:32:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
46d01a8 state 0
May  1 00:39:11 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47d007f state 0
May  1 00:39:27 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
43900df state 0
May  1 00:39:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47f0349 state 0

Since the nodes do not have debuginfo rpm installed, it is hard to tell 
what went wrong inside the kernel. 

Will continue tomorrow.

Comment 2 Wendy Cheng 2007-05-01 20:38:18 UTC

Can't recreate the issue on my cluster nodes (RHEL4.5) so I think it would
be helpful to reinstall the nodes into RHEL4.3 (to match with the customer's
environment). Unfortunately, after all day of trying, the machines can't 
take RHEL4.3. They are all Dell PCI-E machines - too new for RHEL 4.3 and 
kernel keeps panic.

Comment 5 Wendy Cheng 2007-05-03 14:55:01 UTC

Since my nodes (PCI-E) can't take RHEL4.3, I moved one old workstation 
into the lab. So I have three nodes running (one on RHEL4.5, one on RHEL4.4, 
one on RHEL4.3). Amazingly, they talked to each other without troubles (well,
except I brought two racks down when I first joined the workstation into the
cluster due to power capacity issue). Overnight tests also show no signs of
troubles. Will keep running the test every night to see how it goes. The
workstation is running RHEL 4.3. 

In the mean time, here are some thoughts:

1. The updatedb is mostly read - so if the customer can mount the fs with
   "noatime", this could significantly reduce the system stress. We have
   been suspecting RHEL4's DLM can't take too much stress (based on 
   conversation with dct in our previous bugzilla comment). Look like 
   the customer has not followed our "noatime" suggestion yet.
2. I still strongly suspect the customer hit the lock id wrapped around
   issue (fixed in RHEL4.5). I do have a simple systemtap program that 
   can monitor this - unfortunately, it requires debuginfo RPMs to get 
   installed on the nodes to be functional. So encouraging them to move 
   to RHEL4.5 would probably the priority from our end.

Comment 6 Wendy Cheng 2007-05-04 18:26:39 UTC

Still no sign of troubles from my overnight test runs. However, based on 
the experiments done in past few weeks, if I let the test ran long enough, 
it eventually hit bugzilla 199673 (lock id wrap-around). I really think
they should upgrade to R4.5 (that has the lockid fix) if all possible.