Bug 222445 - Thousands of clurgmgrd threads when gfs exported thru nfs
Thousands of clurgmgrd threads when gfs exported thru nfs
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: rgmanager (Show other bugs)
All Linux
medium Severity medium
: ---
: ---
Assigned To: Lon Hohberger
Cluster QE
Depends On:
  Show dependency treegraph
Reported: 2007-01-12 10:53 EST by Robert Peterson
Modified: 2009-04-16 18:36 EDT (History)
1 user (show)

See Also:
Fixed In Version: RHBA-2007-0580
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Last Closed: 2007-11-07 11:45:27 EST
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---

Attachments (Terms of Use)
gdb backtrace (230.27 KB, application/octetstream)
2007-01-12 15:32 EST, Lon Hohberger
no flags Details
Only let one status check thread exist. (1.10 KB, patch)
2007-02-01 12:10 EST, Lon Hohberger
no flags Details | Diff

  None (edit)
Description Robert Peterson 2007-01-12 10:53:41 EST
Description of problem:
I was trying to test my fix for bz222299 on RHEL5, which involves
exporting a GFS file system through NFS, running the nfs_try test
case and simultaneously using a script to move the virtual IP from
node to node every 20 seconds or so.  At RHEL4, this worked just fine
except for an occasional nfs kernel panic (bz 221666).
I let the test run overnight.  When I checked it in the morning,
I noticed that "df" processes, one for each node, seemed to be in a
spin with 100% of the cpu.  I tried gdb, but it wouldn't break in.
So I did a magic sysrq-t to see what it was doing.  Much to my
surprise, I saw more than five thousand clurgmgrd threads:

[root@trin-10 ~]# ps -efL | grep clu | wc -l

This is probably just a side-effect of the df problem.
According to an IRC conversation with Lon:

<lon> #2  0x0805dcd0 in wait_for_dlm_event (ls=0x917f048) at lock.c:54
<lon> #3  0x0805dfd0 in clu_ls_unlock (ls=0x917f048, lksb=0xb7f17fbc) at lock.c:153
<lon> #4  0x0805e261 in clu_unlock (lksb=0xb7f17fbc) at lock.c:268
<lon> it sent an unlock but never got a response from DLM
<lon> (wait_for_dlm_event() is just select() on the dlm file descriptor)

Version-Release number of selected component (if applicable):
RHEL5 Beta 2

How reproducible:
Unknown--I suspect I can recreate it.

Steps to Reproduce:
Follow the same steps as seen in bz222299.
Actual results:
Thousands of clurgmgrd threads exist.

Expected results:
Only a few clurgmgrd threads should exist.

Additional info:
dmesg said: do_vfs_lock: VFS is out of sync with lock manager!
which is a message that comes out of the NFS kernel code.
Comment 1 Lon Hohberger 2007-01-12 15:19:56 EST
FYI, in this case, rgmanager never receives a response to an unlock request.  

So, it is unlikely the cause of the errant behavior in rgmanager is fixable from
within rgmanager, but the symptom is still treatable.
Comment 2 Lon Hohberger 2007-01-12 15:32:20 EST
Created attachment 145486 [details]
gdb backtrace
Comment 3 Lon Hohberger 2007-02-01 12:10:57 EST
Created attachment 147119 [details]
Only let one status check thread exist.
Comment 4 Lon Hohberger 2007-02-21 15:49:56 EST
patches in RHEL5 and HEAD branches
Comment 5 Kiersten (Kerri) Anderson 2007-04-23 13:26:35 EDT
Fixing Product Name.  Cluster Suite was integrated into the Enterprise Linux for
version 5.0.
Comment 7 RHEL Product and Program Management 2007-05-01 13:36:04 EDT
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux maintenance release.  Product Management has requested
further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed
products.  This request is not yet committed for inclusion in an Update
Comment 10 errata-xmlrpc 2007-11-07 11:45:27 EST
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.


Note You need to log in before you can comment on or make changes to this bug.