Bug 238447 - updatedb causing GFS filesystem to hang (RHEL 4 U3)
updatedb causing GFS filesystem to hang (RHEL 4 U3)
Status: CLOSED NOTABUG
Product: Red Hat Cluster Suite
Classification: Red Hat
Component: gfs (Show other bugs)
4
All Linux
high Severity high
: ---
: ---
Assigned To: Abhijith Das
GFS Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2007-04-30 11:48 EDT by Sam Folk-Williams
Modified: 2010-01-11 22:15 EST (History)
2 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2008-11-11 16:28:34 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)

  None (edit)
Comment 1 Wendy Cheng 2007-05-01 01:48:50 EDT
Run on RHEL4U4 with updatedb and bonnie at the same time.. Bonnie keeps
failing with write errors. The following entries are logged in kernel 
messages file:

Apr 30 23:55:44 kanderso-xen-01 sshd(pam_unix)[7940]: session opened for user
root by root(uid=0)
Apr 30 23:57:09 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
35003aa state 0
May  1 00:01:41 kanderso-xen-01 sshd(pam_unix)[8036]: session opened for user
root by root(uid=0)
May  1 00:16:59 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
3dd016e state 0
May  1 00:32:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
46d01a8 state 0
May  1 00:39:11 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47d007f state 0
May  1 00:39:27 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
43900df state 0
May  1 00:39:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47f0349 state 0

Since the nodes do not have debuginfo rpm installed, it is hard to tell 
what went wrong inside the kernel. 

Will continue tomorrow.
Comment 2 Wendy Cheng 2007-05-01 16:38:18 EDT
Can't recreate the issue on my cluster nodes (RHEL4.5) so I think it would
be helpful to reinstall the nodes into RHEL4.3 (to match with the customer's
environment). Unfortunately, after all day of trying, the machines can't 
take RHEL4.3. They are all Dell PCI-E machines - too new for RHEL 4.3 and 
kernel keeps panic.
Comment 5 Wendy Cheng 2007-05-03 10:55:01 EDT
Since my nodes (PCI-E) can't take RHEL4.3, I moved one old workstation 
into the lab. So I have three nodes running (one on RHEL4.5, one on RHEL4.4, 
one on RHEL4.3). Amazingly, they talked to each other without troubles (well,
except I brought two racks down when I first joined the workstation into the
cluster due to power capacity issue). Overnight tests also show no signs of
troubles. Will keep running the test every night to see how it goes. The
workstation is running RHEL 4.3. 

In the mean time, here are some thoughts:

1. The updatedb is mostly read - so if the customer can mount the fs with
   "noatime", this could significantly reduce the system stress. We have
   been suspecting RHEL4's DLM can't take too much stress (based on 
   conversation with dct in our previous bugzilla comment). Look like 
   the customer has not followed our "noatime" suggestion yet.
2. I still strongly suspect the customer hit the lock id wrapped around
   issue (fixed in RHEL4.5). I do have a simple systemtap program that 
   can monitor this - unfortunately, it requires debuginfo RPMs to get 
   installed on the nodes to be functional. So encouraging them to move 
   to RHEL4.5 would probably the priority from our end.

Comment 6 Wendy Cheng 2007-05-04 14:26:39 EDT
Still no sign of troubles from my overnight test runs. However, based on 
the experiments done in past few weeks, if I let the test ran long enough, 
it eventually hit bugzilla 199673 (lock id wrap-around). I really think
they should upgrade to R4.5 (that has the lockid fix) if all possible.

Note You need to log in before you can comment on or make changes to this bug.