Description of problem:
We've observed that dlm_recvd consumes much of CPU time on only one node
First 2node cluster
pro1a::> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
11740 Feb 19 00:00:00 clurgmgrd
11741 Feb 19 1-07:32:39 clurgmgrd
13915 Feb 19 1-02:22:35 [dlm_recvd]
20510 09:42:36 00:00:00 egrep clurgmgrd|dlm_recvd
pro2a::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
9207 Feb 25 00:09:00 clurgmgrd -d
11100 Feb 25 00:15:02 [dlm_recvd]
32249 09:42:39 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar 3 09:42:39 CET 2007
Second 2node cluster
pro1b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
9310 Feb 14 00:00:00 clurgmgrd
9312 Feb 14 2-14:12:38 clurgmgrd
11589 Feb 14 2-05:39:19 [dlm_recvd]
26396 09:42:41 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar 3 09:42:41 CET 2007
pro2b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd"
27767 Feb 22 00:00:00 clurgmgrd
27768 Feb 22 00:15:23 clurgmgrd
29929 Feb 22 00:01:59 [dlm_recvd]
6200 09:42:42 00:00:00 egrep clurgmgrd|dlm_recvd
Sat Mar 3 09:42:42 CET 2007
Version-Release number of selected component (if applicable):
Steps to Reproduce:
Much higher load on node1 than on node2
dlm_recvd should consume less of CPU time...
Please explain this disproportion in dlm_recvd CPU usage on node1 and node2
(regarding both cluster) In our configuration we use those nodes to work as
loadbalanced service so load should not differ to much, but we are observing
load 10 on node1 and load <1 on node2 ...
We have checked on non production cluster (services enabled, but no data were
computed) and on node1 we saw load 10(max), while node2 had load ~0.5 (usually less)
we have 4gfs, and 12ext3 mounted from SAN storage, this cluster acts as
loadbalanced smtp/pop3/smb/mysql(ndbd) server. Balance is made by external
device. (resource usage of services are similar on both nodes)
If you aren't using them yet, please install the ones here:
Or the ones from bug #212634.
Ok, that out of the way, there will always be more CPU time consumed on a given
dlm_recvd simply due to the architecture of the DLM. However, it shouldn't be a
large amount - the amount you're seeing is probably related to bug #212634 -
what happens is that in some cases, rgmanager will (obviously incorrectly) leak
DLM locks. On the node mastering the locks, this will cause dlm_recvd to have
to traverse a longer list of locks - ending up with more and more system time
*** This bug has been marked as a duplicate of 212634 ***
The packages in #228823 should also fix this.