Bug 238447
Summary: | updatedb causing GFS filesystem to hang (RHEL 4 U3) | ||
---|---|---|---|
Product: | [Retired] Red Hat Cluster Suite | Reporter: | Sam Knuth <sfolkwil> |
Component: | gfs | Assignee: | Abhijith Das <adas> |
Status: | CLOSED NOTABUG | QA Contact: | GFS Bugs <gfs-bugs> |
Severity: | high | Docs Contact: | |
Priority: | high | ||
Version: | 4 | CC: | hlawatschek, sfolkwil |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2008-11-11 21:28:34 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Comment 1
Wendy Cheng
2007-05-01 05:48:50 UTC
Can't recreate the issue on my cluster nodes (RHEL4.5) so I think it would be helpful to reinstall the nodes into RHEL4.3 (to match with the customer's environment). Unfortunately, after all day of trying, the machines can't take RHEL4.3. They are all Dell PCI-E machines - too new for RHEL 4.3 and kernel keeps panic. Since my nodes (PCI-E) can't take RHEL4.3, I moved one old workstation into the lab. So I have three nodes running (one on RHEL4.5, one on RHEL4.4, one on RHEL4.3). Amazingly, they talked to each other without troubles (well, except I brought two racks down when I first joined the workstation into the cluster due to power capacity issue). Overnight tests also show no signs of troubles. Will keep running the test every night to see how it goes. The workstation is running RHEL 4.3. In the mean time, here are some thoughts: 1. The updatedb is mostly read - so if the customer can mount the fs with "noatime", this could significantly reduce the system stress. We have been suspecting RHEL4's DLM can't take too much stress (based on conversation with dct in our previous bugzilla comment). Look like the customer has not followed our "noatime" suggestion yet. 2. I still strongly suspect the customer hit the lock id wrapped around issue (fixed in RHEL4.5). I do have a simple systemtap program that can monitor this - unfortunately, it requires debuginfo RPMs to get installed on the nodes to be functional. So encouraging them to move to RHEL4.5 would probably the priority from our end. Still no sign of troubles from my overnight test runs. However, based on the experiments done in past few weeks, if I let the test ran long enough, it eventually hit bugzilla 199673 (lock id wrap-around). I really think they should upgrade to R4.5 (that has the lockid fix) if all possible. |