616095 – corosync process eats 90+% of CPU, node fenced during add/remove test

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 616095 - corosync process eats 90+% of CPU, node fenced during add/remove test

Summary: corosync process eats 90+% of CPU, node fenced during add/remove test

Keywords:
Status:	CLOSED DUPLICATE of bug 580741
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	corosync
Sub Component:
Version:	6.0
Hardware:	All
OS:	Linux
Priority:	low
Severity:	medium
Target Milestone:	rc
Target Release:	---
Assignee:	Angus Salkeld
QA Contact:	Cluster QE
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-07-19 16:24 UTC by Dean Jansa
Modified:	2010-07-22 19:56 UTC (History)
CC List:	2 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-07-22 19:38:45 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Description Dean Jansa 2010-07-19 16:24:03 UTC

Description of problem:

Running a node removal/addition test in a loop will trigger one of the nodes in a cluster corosync process to eat 90+% of the CPU.  This in turn will cause the cluster to fence this node.  The node effected is not one of the "new" nodes being added, rather a node which is a part of the core cluster the entire time.

Version-Release number of selected component (if applicable):
RHEL6.0-20100707.4-Server
corosync-1.2.3-9.el6.x86_64

How reproducible:

25-100 iterations of remove/add test case 'pruner' from the sts rpm.
/usr/test/sts-rhel6.0/ccs/bin/pruner -R <resource file/cluster.conf>


Steps to Reproduce:
/usr/test/sts-rhel6.0/ccs/bin/pruner -R <resource file/cluster.conf>

Or -- 

See: 
https://bugzilla.redhat.com/show_bug.cgi?id=532730, follow directions to remove node.

Add back

repeat.


1. 
2.
3.
  
Actual results:

One node is fenced.



Expected results:

Nodes removed/added as long as we care to loop the test.

Additional info:

Comment 2 Steven Dake 2010-07-19 16:44:31 UTC

Angus,

Can you please verify if this is a dup of https://bugzilla.redhat.com/show_bug.cgi?id=580741.  Ie run with -9 (in this bug report) then run with -11.

The symptoms sound like they should be resolved with the -11 build.

Comment 3 RHEL Program Management 2010-07-19 16:57:38 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 4 RHEL Program Management 2010-07-19 17:17:38 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 5 RHEL Program Management 2010-07-19 17:37:37 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 6 Steven Dake 2010-07-19 17:50:05 UTC

Sly,

Have not yet set flags as requested in your earlier email.  Angus will triage
and when that is finished we will either close as dup (most likely scenario),
target for 6.1, or 6.0 blocker? depending on how serious the defect is.  Should
have answer in 1-2 days.

Thanks
-steve

Comment 7 RHEL Program Management 2010-07-19 17:57:58 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 8 RHEL Program Management 2010-07-19 18:17:47 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 9 RHEL Program Management 2010-07-19 18:37:58 UTC

This issue has been proposed when we are only considering blocker
issues in the current Red Hat Enterprise Linux release.

** If you would still like this issue considered for the current
release, ask your support representative to file as a blocker on
your behalf. Otherwise ask that it be considered for the next
Red Hat Enterprise Linux release. **

Comment 10 Angus Salkeld 2010-07-20 00:24:30 UTC

Hi guys, I keep getting the error below. Any ideas on
getting this test to run? Do I need anything special in
config file (I am using the same one from a laryngitis test run)?

$ rpm -q corosync cman
corosync-1.2.3-9.el6.x86_64
cman-3.0.12-9.el6.x86_64

$ ./ccs/bin/pruner -R /home/asalkeld/virty4.xml 
ricci (pid  1858) is running...
ricci (pid  1806) is running...
ricci (pid  1828) is running...
ricci (pid  1811) is running...
Grabbing running cluster.conf from r4
r4:/etc/cluster/cluster.conf  -> /tmp/cluster.conf.orig.pruner.
Removing r4 from cluster
Stopping cluster: 
   Leaving fence domain... [  OK  ]
   Stopping gfs_controld... [  OK  ]
   Stopping dlm_controld... [  OK  ]
   Stopping fenced... [  OK  ]
   Stopping cman... [  OK  ]
   Waiting for corosync to shutdown:[  OK  ]
   Unloading kernel modules... [  OK  ]
   Unmounting configfs... [  OK  ]
Removing r4 from /tmp/cluster.conf.pruner.30035
1 nodes removed from /tmp/cluster.conf.pruner.30035
Bumping config_version to 3
Distributing /tmp/cluster.conf.pruner.30035 to one remaning node
/tmp/cluster.conf.pruner.30035  -> r1:/etc/cluster/cluster.conf
Updating cman's view of the cluster
Checking node count, vote and quorum on r1
New Node count:  3
r1: Node count: 3
New Expected votes: 3
r1: Expected votes: 3
New Quorum: 2
r1: Quorum: 3
Quorum votes on the running cluster does not match the
expected value!
Expected Quorum: 2
Cman has: 3

---------------
$ cman_tool status
Version: 6.2.0
Config Version: 3
Cluster Name: virty
Cluster Id: 13185
Cluster Member: Yes
Cluster Generation: 19996
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 3  
Active subsystems: 1
Flags: 
Ports Bound: 0  
Node name: r2
Node ID: 2
Multicast addresses: 239.192.51.180 
Node addresses: 192.168.100.92

Comment 11 Steven Dake 2010-07-20 07:15:41 UTC

Chrissie indicated comment #10 may be explained by Bug 606989.

Comment 13 Angus Salkeld 2010-07-21 00:59:24 UTC

Thanks guys, I am now using

corosync-1.2.3-9.el6.x86_64
cman-3.0.12-14.el6.x86_64

and the quorum problem seems to be fixed. Only problem
is that I can't reproduce the bug. I have run through 200
iterations without seeing anything unusual.

Dean does the test actually fail (return != 0)?

My little script does this (and doesn't fail) - output looks good too:
#!/bin/bash
set -e
for i in {1..200}
do
	echo "=================================="
	echo "=> Iteration $i"
	./ccs/bin/pruner -R /home/asalkeld/virty4.xml
done

Comment 14 Dean Jansa 2010-07-22 15:11:35 UTC

With the fix in -13 I don't see this test fail either. Looks like we can close this as a duplicate of  Bug 606989.

Comment 15 Steven Dake 2010-07-22 19:38:45 UTC


*** This bug has been marked as a duplicate of bug 606989 ***

Comment 16 Dean Jansa 2010-07-22 19:56:11 UTC


*** This bug has been marked as a duplicate of bug 580741 ***

Note You need to log in before you can comment on or make changes to this bug.