Bug 499734
Summary: | cluster goes down and ends up in OOM situation during umounts | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | openais | Assignee: | Steven Dake <sdake> | ||||||||
Status: | CLOSED DUPLICATE | QA Contact: | Cluster QE <mspqa-list> | ||||||||
Severity: | high | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.3 | CC: | cluster-maint, edamato, teigland | ||||||||
Target Milestone: | rc | Keywords: | Regression | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-05-23 09:34:05 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Attachments: |
|
Description
Corey Marthaler
2009-05-07 19:56:04 UTC
Created attachment 342933 [details]
log from grant-01
Created attachment 342934 [details]
log from grant-02
Created attachment 342936 [details]
log from grant-03
This is reproducable. I attempted to umount the same 50 GFS on each of the nodes, and as soon as I attempted it, grant-01 was fenced and the remaining umount cmds got stuck. [root@grant-02 ~]# ps -elf | grep umount 0 S root 13270 10036 0 75 0 - 17939 wait 15:57 pts/0 00:00:00 umount /mnt/B1 /mnt/B10 /mnt/B11 /mnt/B12 /mnt/B13 /mnt/B14 /mnt/B15 /mnt/B16 /mnt/B17 /mnt/B18 /mnt/B19 /mnt/B2 /mnt/B20 /mnt/B21 /mnt/B22 /mnt/B23 /mnt/B24 /mnt/B25 /mnt/B3 /mnt/B4 /mnt/B5 /mnt/B6 /mnt/B7 /mnt/B8 /mnt/B9 /mnt/C1 /mnt/C10 /mnt/C11 /mnt/C12 /mnt/C13 /mnt/C14 /mnt/C15 /mnt/C16 /mnt/C17 /mnt/C18 /mnt/C19 /mnt/C2 /mnt/C20 /mnt/C21 /mnt/C22 /mnt/C23 /mnt/C24 /mnt/C25 /mnt/C3 /mnt/C4 /mnt/C5 /mnt/C6 /mnt/C7 /mnt/C8 /mnt/C9 4 D root 13380 13270 0 78 0 - 951 glock_ 15:57 pts/0 00:00:00 /sbin/umount.gfs /mnt/B15 0 S root 13670 13546 0 78 0 - 15289 pipe_w 16:09 pts/2 00:00:00 grep umount It seems that openais is "going away" when it shouldn't. Unfortunately, we usually have to infer this from the effects that has on other things, since when openais goes away it generally disappears without a word. are there coredumps on the machines? /var/lib/openais There were three core dumps from May 7th on the systems, 2 on grant-01 and 1 on grant-02. I'll attach them. check that, the core's are too large too attach. check them out on the machines listed above. core dumps not on machines, looks like they have been reloaded. Corey, can you retest this with the latest build of openais to see if you can reproduce the issue? Also please capture the core files and backtraces and put them somewhere where they won't be removed so sdake can examine. Thanks! This is basic stuff, things like this have worked without any problem for a long time. There are other recent bz's about serious openais regressions that have suddenly appeared. I think 5.3 openais was good, and things started to crumble in 5.3.z. Dave, Your opinion has no proposal for solving the problem. Steven, eh? Perry, with the latest openais/cman (openais-0.80.6-1.el5 built last week), I'm barely able to even mount a gfs filesystem, more less mount 25 and then attempt unmounts. Testing of this bug is blocked behind bug 501561. I too may not have a proposal for solving this problem, but things appear to have regressed a lot in the lastest 5.4 cluster stuff. Corey This is likely a dupe of 501561. Can you retest once 501561 hits brew? Thanks *** Bug 480709 has been marked as a duplicate of this bug. *** Fix verified in openais-0.80.6-2.el5 / cman-2.0.103-1.el5. This is just another symptom of the bug fixed in 501561. Marking as duplicate of that bug. Thanks for retesting though. *** This bug has been marked as a duplicate of bug 501561 *** |