Run on RHEL4U4 with updatedb and bonnie at the same time.. Bonnie keeps
failing with write errors. The following entries are logged in kernel
Apr 30 23:55:44 kanderso-xen-01 sshd(pam_unix): session opened for user
root by root(uid=0)
Apr 30 23:57:09 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
35003aa state 0
May 1 00:01:41 kanderso-xen-01 sshd(pam_unix): session opened for user
root by root(uid=0)
May 1 00:16:59 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
3dd016e state 0
May 1 00:32:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
46d01a8 state 0
May 1 00:39:11 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47d007f state 0
May 1 00:39:27 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
43900df state 0
May 1 00:39:53 kanderso-xen-01 kernel: dlm: gfs1: process_lockqueue_reply id
47f0349 state 0
Since the nodes do not have debuginfo rpm installed, it is hard to tell
what went wrong inside the kernel.
Will continue tomorrow.
Can't recreate the issue on my cluster nodes (RHEL4.5) so I think it would
be helpful to reinstall the nodes into RHEL4.3 (to match with the customer's
environment). Unfortunately, after all day of trying, the machines can't
take RHEL4.3. They are all Dell PCI-E machines - too new for RHEL 4.3 and
kernel keeps panic.
Since my nodes (PCI-E) can't take RHEL4.3, I moved one old workstation
into the lab. So I have three nodes running (one on RHEL4.5, one on RHEL4.4,
one on RHEL4.3). Amazingly, they talked to each other without troubles (well,
except I brought two racks down when I first joined the workstation into the
cluster due to power capacity issue). Overnight tests also show no signs of
troubles. Will keep running the test every night to see how it goes. The
workstation is running RHEL 4.3.
In the mean time, here are some thoughts:
1. The updatedb is mostly read - so if the customer can mount the fs with
"noatime", this could significantly reduce the system stress. We have
been suspecting RHEL4's DLM can't take too much stress (based on
conversation with dct in our previous bugzilla comment). Look like
the customer has not followed our "noatime" suggestion yet.
2. I still strongly suspect the customer hit the lock id wrapped around
issue (fixed in RHEL4.5). I do have a simple systemtap program that
can monitor this - unfortunately, it requires debuginfo RPMs to get
installed on the nodes to be functional. So encouraging them to move
to RHEL4.5 would probably the priority from our end.
Still no sign of troubles from my overnight test runs. However, based on
the experiments done in past few weeks, if I let the test ran long enough,
it eventually hit bugzilla 199673 (lock id wrap-around). I really think
they should upgrade to R4.5 (that has the lockid fix) if all possible.