Bug 499734 - cluster goes down and ends up in OOM situation during umounts
cluster goes down and ends up in OOM situation during umounts
Status: CLOSED DUPLICATE of bug 501561
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: openais (Show other bugs)
5.3
All Linux
high Severity high
: rc
: ---
Assigned To: Steven Dake
Cluster QE
: Regression
: 480709 (view as bug list)
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2009-05-07 15:56 EDT by Corey Marthaler
Modified: 2016-04-26 12:03 EDT (History)
3 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2009-05-23 05:34:05 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
log from grant-01 (136.37 KB, text/plain)
2009-05-07 16:17 EDT, Corey Marthaler
no flags Details
log from grant-02 (46.24 KB, text/plain)
2009-05-07 16:18 EDT, Corey Marthaler
no flags Details
log from grant-03 (34.04 KB, text/plain)
2009-05-07 16:21 EDT, Corey Marthaler
no flags Details

  None (edit)
Description Corey Marthaler 2009-05-07 15:56:04 EDT
Description of problem:
I was attempting to repo bug 480709. I had 25 gfs mounted on each node of the 3-node grant cluster, and then attempted to umount all of them. Doing so caused the cluster to go crazy and two of the nodes even died due to OOM conditions.

This appears to be a regression, I don't recall seeing anything like this in the past year or so.

Version-Release number of selected component (if applicable):
2.6.18-138.el5
cman-2.0.98-1.el5_3.1
openais-0.80.3-22.el5_3.4

I'll attach the logs leading up to the failures.
Comment 1 Corey Marthaler 2009-05-07 16:17:50 EDT
Created attachment 342933 [details]
log from grant-01
Comment 2 Corey Marthaler 2009-05-07 16:18:25 EDT
Created attachment 342934 [details]
log from grant-02
Comment 3 Corey Marthaler 2009-05-07 16:21:35 EDT
Created attachment 342936 [details]
log from grant-03
Comment 4 Corey Marthaler 2009-05-07 17:10:48 EDT
This is reproducable. I attempted to umount the same 50 GFS on each of the nodes, and as soon as I attempted it, grant-01 was fenced and the remaining umount cmds got stuck.

[root@grant-02 ~]# ps -elf | grep umount
0 S root     13270 10036  0  75   0 - 17939 wait   15:57 pts/0    00:00:00 umount /mnt/B1 /mnt/B10 /mnt/B11 /mnt/B12 /mnt/B13 /mnt/B14 /mnt/B15 /mnt/B16 /mnt/B17 /mnt/B18 /mnt/B19 /mnt/B2 /mnt/B20 /mnt/B21 /mnt/B22 /mnt/B23 /mnt/B24 /mnt/B25 /mnt/B3 /mnt/B4 /mnt/B5 /mnt/B6 /mnt/B7 /mnt/B8 /mnt/B9 /mnt/C1 /mnt/C10 /mnt/C11 /mnt/C12 /mnt/C13 /mnt/C14 /mnt/C15 /mnt/C16 /mnt/C17 /mnt/C18 /mnt/C19 /mnt/C2 /mnt/C20 /mnt/C21 /mnt/C22 /mnt/C23 /mnt/C24 /mnt/C25 /mnt/C3 /mnt/C4 /mnt/C5 /mnt/C6 /mnt/C7 /mnt/C8 /mnt/C9
4 D root     13380 13270  0  78   0 -   951 glock_ 15:57 pts/0    00:00:00 /sbin/umount.gfs /mnt/B15
0 S root     13670 13546  0  78   0 - 15289 pipe_w 16:09 pts/2    00:00:00 grep umount
Comment 5 David Teigland 2009-05-07 17:32:50 EDT
It seems that openais is "going away" when it shouldn't.  Unfortunately, we usually have to infer this from the effects that has on other things, since when openais goes away it generally disappears without a word.
Comment 6 Steven Dake 2009-05-13 07:06:50 EDT
are there coredumps on the machines?

/var/lib/openais
Comment 7 Corey Marthaler 2009-05-13 11:45:09 EDT
There were three core dumps from May 7th on the systems, 2 on grant-01 and 1 on grant-02. I'll attach them.
Comment 8 Corey Marthaler 2009-05-13 11:51:53 EDT
check that, the core's are too large too attach. check them out on the machines listed above.
Comment 9 Steven Dake 2009-05-18 18:09:04 EDT
core dumps not on machines, looks like they have been reloaded.
Comment 10 Perry Myers 2009-05-20 09:32:46 EDT
Corey, can you retest this with the latest build of openais to see if you can reproduce the issue?  Also please capture the core files and backtraces and put them somewhere where they won't be removed so sdake can examine.  Thanks!
Comment 12 David Teigland 2009-05-20 09:54:35 EDT
This is basic stuff, things like this have worked without any problem for a long time.  There are other recent bz's about serious openais regressions that have suddenly appeared.  I think 5.3 openais was good, and things started to crumble in 5.3.z.
Comment 13 Steven Dake 2009-05-20 10:50:15 EDT
Dave,

Your opinion has no proposal for solving the problem.
Comment 14 David Teigland 2009-05-20 10:59:18 EDT
Steven, eh?
Comment 15 Corey Marthaler 2009-05-20 11:05:06 EDT
Perry, with the latest openais/cman (openais-0.80.6-1.el5 built last week), I'm
barely able to even mount a gfs filesystem, more less mount 25 and then attempt
unmounts. Testing of this bug is blocked behind bug 501561. 

I too may not have a proposal for solving this problem, but things appear to
have regressed a lot in the lastest 5.4 cluster stuff.
Comment 16 Steven Dake 2009-05-21 10:17:30 EDT
Corey

This is likely a dupe of 501561.  Can you retest once 501561 hits brew?

Thanks
Comment 17 Steven Dake 2009-05-21 10:22:21 EDT
*** Bug 480709 has been marked as a duplicate of this bug. ***
Comment 18 Corey Marthaler 2009-05-22 16:04:30 EDT
Fix verified in openais-0.80.6-2.el5 / cman-2.0.103-1.el5.
Comment 20 Steven Dake 2009-05-23 05:34:05 EDT
This is just another symptom of the bug fixed in 501561.  Marking as duplicate of that bug.  Thanks for retesting though.

*** This bug has been marked as a duplicate of bug 501561 ***

Note You need to log in before you can comment on or make changes to this bug.