Description of problem: We've observed that dlm_recvd consumes much of CPU time on only one node First 2node cluster pro1a::> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd" 11740 Feb 19 00:00:00 clurgmgrd 11741 Feb 19 1-07:32:39 clurgmgrd 13915 Feb 19 1-02:22:35 [dlm_recvd] 20510 09:42:36 00:00:00 egrep clurgmgrd|dlm_recvd pro2a::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd" 9207 Feb 25 00:09:00 clurgmgrd -d 11100 Feb 25 00:15:02 [dlm_recvd] 32249 09:42:39 00:00:00 egrep clurgmgrd|dlm_recvd Sat Mar 3 09:42:39 CET 2007 Second 2node cluster pro1b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd" 9310 Feb 14 00:00:00 clurgmgrd 9312 Feb 14 2-14:12:38 clurgmgrd 11589 Feb 14 2-05:39:19 [dlm_recvd] 26396 09:42:41 00:00:00 egrep clurgmgrd|dlm_recvd Sat Mar 3 09:42:41 CET 2007 pro2b::~> ps axo pid,start,time,args|egrep "clurgmgrd|dlm_recvd" 27767 Feb 22 00:00:00 clurgmgrd 27768 Feb 22 00:15:23 clurgmgrd 29929 Feb 22 00:01:59 [dlm_recvd] 6200 09:42:42 00:00:00 egrep clurgmgrd|dlm_recvd Sat Mar 3 09:42:42 CET 2007 Version-Release number of selected component (if applicable): RH4U4 How reproducible: 100% Steps to Reproduce: 1. 2. 3. Actual results: Much higher load on node1 than on node2 Expected results: dlm_recvd should consume less of CPU time... Please explain this disproportion in dlm_recvd CPU usage on node1 and node2 Additional info: (regarding both cluster) In our configuration we use those nodes to work as loadbalanced service so load should not differ to much, but we are observing load 10 on node1 and load <1 on node2 ... We have checked on non production cluster (services enabled, but no data were computed) and on node1 we saw load 10(max), while node2 had load ~0.5 (usually less) we have 4gfs, and 12ext3 mounted from SAN storage, this cluster acts as loadbalanced smtp/pop3/smb/mysql(ndbd) server. Balance is made by external device. (resource usage of services are similar on both nodes)
If you aren't using them yet, please install the ones here: http://people.redhat.com/lhh/packages.html Or the ones from bug #212634. Ok, that out of the way, there will always be more CPU time consumed on a given dlm_recvd simply due to the architecture of the DLM. However, it shouldn't be a large amount - the amount you're seeing is probably related to bug #212634 - what happens is that in some cases, rgmanager will (obviously incorrectly) leak DLM locks. On the node mastering the locks, this will cause dlm_recvd to have to traverse a longer list of locks - ending up with more and more system time being used. *** This bug has been marked as a duplicate of 212634 ***
The packages in #228823 should also fix this.