Bug 222445 - Thousands of clurgmgrd threads when gfs exported thru nfs
Summary: Thousands of clurgmgrd threads when gfs exported thru nfs
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
(Show other bugs)
Version: 5.0
Hardware: All Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Lon Hohberger
QA Contact: Cluster QE
URL:
Whiteboard:
Keywords:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-12 15:53 UTC by Robert Peterson
Modified: 2009-04-16 22:36 UTC (History)
1 user (show)

Fixed In Version: RHBA-2007-0580
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2007-11-07 16:45:27 UTC
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
gdb backtrace (230.27 KB, application/octetstream)
2007-01-12 20:32 UTC, Lon Hohberger
no flags Details
Only let one status check thread exist. (1.10 KB, patch)
2007-02-01 17:10 UTC, Lon Hohberger
no flags Details | Diff


External Trackers
Tracker ID Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0580 normal SHIPPED_LIVE rgmanager bug fix and enhancement update 2007-10-30 15:37:24 UTC

Description Robert Peterson 2007-01-12 15:53:41 UTC
Description of problem:
I was trying to test my fix for bz222299 on RHEL5, which involves
exporting a GFS file system through NFS, running the nfs_try test
case and simultaneously using a script to move the virtual IP from
node to node every 20 seconds or so.  At RHEL4, this worked just fine
except for an occasional nfs kernel panic (bz 221666).
I let the test run overnight.  When I checked it in the morning,
I noticed that "df" processes, one for each node, seemed to be in a
spin with 100% of the cpu.  I tried gdb, but it wouldn't break in.
So I did a magic sysrq-t to see what it was doing.  Much to my
surprise, I saw more than five thousand clurgmgrd threads:

[root@trin-10 ~]# ps -efL | grep clu | wc -l
5741

This is probably just a side-effect of the df problem.
According to an IRC conversation with Lon:

<lon> #2  0x0805dcd0 in wait_for_dlm_event (ls=0x917f048) at lock.c:54
<lon> #3  0x0805dfd0 in clu_ls_unlock (ls=0x917f048, lksb=0xb7f17fbc) at lock.c:153
<lon> #4  0x0805e261 in clu_unlock (lksb=0xb7f17fbc) at lock.c:268
<lon> it sent an unlock but never got a response from DLM
<lon> (wait_for_dlm_event() is just select() on the dlm file descriptor)

Version-Release number of selected component (if applicable):
RHEL5 Beta 2

How reproducible:
Unknown--I suspect I can recreate it.

Steps to Reproduce:
Follow the same steps as seen in bz222299.
  
Actual results:
Thousands of clurgmgrd threads exist.

Expected results:
Only a few clurgmgrd threads should exist.

Additional info:
dmesg said: do_vfs_lock: VFS is out of sync with lock manager!
which is a message that comes out of the NFS kernel code.

Comment 1 Lon Hohberger 2007-01-12 20:19:56 UTC
FYI, in this case, rgmanager never receives a response to an unlock request.  

So, it is unlikely the cause of the errant behavior in rgmanager is fixable from
within rgmanager, but the symptom is still treatable.

Comment 2 Lon Hohberger 2007-01-12 20:32:20 UTC
Created attachment 145486 [details]
gdb backtrace

Comment 3 Lon Hohberger 2007-02-01 17:10:57 UTC
Created attachment 147119 [details]
Only let one status check thread exist.

Comment 4 Lon Hohberger 2007-02-21 20:49:56 UTC
patches in RHEL5 and HEAD branches

Comment 5 Kiersten (Kerri) Anderson 2007-04-23 17:26:35 UTC
Fixing Product Name.  Cluster Suite was integrated into the Enterprise Linux for
version 5.0.

Comment 7 RHEL Product and Program Management 2007-05-01 17:36:04 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
release.

Comment 10 errata-xmlrpc 2007-11-07 16:45:27 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0580.html



Note You need to log in before you can comment on or make changes to this bug.