Bug 230830

Summary: 2Node cluster - dlm_recvd consumes resources, one node has bigger load than another
Product: [Retired] Red Hat Cluster Suite Reporter: Tomasz Jaszowski <tjaszowski>
Component: rgmanagerAssignee: Lon Hohberger <lhh>
Status: CLOSED DUPLICATE QA Contact: Cluster QE <mspqa-list>
Severity: high Docs Contact:
Priority: medium    
Version: 4CC: cluster-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-05 16:40:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:

Description Tomasz Jaszowski 2007-03-03 09:52:28 UTC
Description of problem:

We've observed that dlm_recvd consumes much of CPU time on only one node

First 2node cluster
pro1a::> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
11740   Feb 19 00:00:00 clurgmgrd
11741   Feb 19 1-07:32:39 clurgmgrd
13915   Feb 19 1-02:22:35 [dlm_recvd]
20510 09:42:36 00:00:00 egrep clurgmgrd|dlm_recvd

pro2a::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
 9207   Feb 25 00:09:00 clurgmgrd -d
11100   Feb 25 00:15:02 [dlm_recvd]
32249 09:42:39 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar  3 09:42:39 CET 2007


Second 2node cluster
pro1b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
 9310   Feb 14 00:00:00 clurgmgrd
 9312   Feb 14 2-14:12:38 clurgmgrd
11589   Feb 14 2-05:39:19 [dlm_recvd]
26396 09:42:41 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar  3 09:42:41 CET 2007

pro2b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
27767   Feb 22 00:00:00 clurgmgrd
27768   Feb 22 00:15:23 clurgmgrd
29929   Feb 22 00:01:59 [dlm_recvd]
 6200 09:42:42 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar  3 09:42:42 CET 2007


Version-Release number of selected component (if applicable):
RH4U4

How reproducible:
100%

Steps to Reproduce:
1.
2.
3.
  
Actual results:
Much higher load on node1 than on node2

Expected results:
dlm_recvd should consume less of CPU time...

Please explain this disproportion in dlm_recvd CPU usage on node1 and node2

Additional info:
(regarding both cluster) In our configuration we use those nodes to work as
loadbalanced service so load should not differ to much, but we are observing
load 10 on node1 and load <1 on node2 ... 

We have checked on non production cluster (services enabled, but no data were
computed) and on node1 we saw load 10(max), while node2 had load ~0.5 (usually less)

we have 4gfs, and 12ext3 mounted from SAN storage, this cluster acts as
loadbalanced smtp/pop3/smb/mysql(ndbd) server. Balance is made by external
device. (resource usage of services are similar on both nodes)

Comment 1 Lon Hohberger 2007-03-05 16:40:07 UTC
If you aren't using them yet, please install the ones here:

http://people.redhat.com/lhh/packages.html

Or the ones from bug #212634.

Ok, that out of the way, there will always be more CPU time consumed on a given
dlm_recvd simply due to the architecture of the DLM.  However, it shouldn't be a
large amount - the amount you're seeing is probably related to bug #212634 -
what happens is that in some cases, rgmanager will (obviously incorrectly) leak
DLM locks.  On the node mastering the locks, this will cause dlm_recvd to have
to traverse a longer list of locks - ending up with more and more system time
being used.

*** This bug has been marked as a duplicate of 212634 ***

Comment 2 Lon Hohberger 2007-03-05 16:43:58 UTC
The packages in #228823 should also fix this.