Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 675568

Summary:	tickless kernel - hangs/delays when corosync ist started
Product:	Red Hat Enterprise Linux 6	Reporter:	schurzi <redhat>
Component:	kernel	Assignee:	Red Hat Kernel Manager <kernel-mgr>
Status:	CLOSED DUPLICATE	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	high	Docs Contact:
Priority:	unspecified
Version:	6.0	CC:	bruno.travouillon, leiwang, mburns, prarit, redhat, sdake
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Linux
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-07-22 16:27:56 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description schurzi 2011-02-06 18:04:08 UTC

Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 schurzi 2011-02-06 18:05:28 UTC

sorry, I hit the Enter button a bit to early ...

Description of problem:
The Server hangs (eg delayed responsetimes in SSH and all othe rapplications) for almost exact 5 Minutes and then runs fine for 0,25 - 5 hours. Then the server hangs again. This problem happens only on hardware Servers, I am unable to reproduce this behaviour with a Xen VM.

I could only reproduce the problem, when running directly on hardware with the corosync software started. When only corosync is strarted I could only observe delayed responsetimes. When pacemaker was running I could also observe a high CPU load (user and system) but there was never an increase in loadaverage.

Version-Release number of selected component (if applicable):
RHEL 6 latest patches

How reproducible:
We did Serveral fresh installs on two HP DL370 Servers with 2x Intel(R) Xeon(R) CPU X5550 and this was reproducible every time.

Steps to Reproduce:
1. Do a fresh install of RHEL6
2. yum update
3. yum install pacemaker
4. cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
5. /etc/init.d/corosync start

It is not needed to put additional load in the server, this problem surfaces even on a idle system. The only running processes (except the default processes) were my SSH session and corosync.

I added a vmstat script, wich collected the system data.

---snip---
cat > /etc/init/vmstat.conf <<EOF
respawn
task
exec /usr/bin/vmstat -t 1 >> /tmp/vmstat.log &
EOF

initctl start vmstat
---snip---

Then I analysed the output with this script
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59

Actual results:
Rhe script returned the number of probes that have been taken. In a time with hangs you can see only 40 probes.
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59
40 2011-02-01 09:32
34 2011-02-01 09:33
41 2011-02-01 09:34
40 2011-02-01 09:35
41 2011-02-01 09:36

As you see, there were only 40 vmstat probes per minute for full 5 minutes.

Expected results:
60 (or 59) probes logged all the time and no hangs.

Additional info:
I currently don't know what triggers this behaviour, but as soon, as corosync is stopped, the systems behaves normal. I can provide an example pacemaker configuration, so you can reproduce the high CPU usage as well, but I don't think this is nessesary. Since the high CPU utilisation most likely is a measuring error triggered by this problem.

One important note may be, that a Process which is running at these moments does not seem affected, but only the processes that are waken up. I discovered this when i tried to measure the latency of file opperaations in /proc. A Perl script which used sleep to wait one second did also generate only 40 Probes a minute. Then I rewrote the sleep to a while loop which cecked gettimeofday() to "sleep". Thus there was no voluntary context switch! This program generated 60 Probes everytime, even in time of hangs. I don't know if this problem is limited to scheduling of processes alone or if it is on a lower level.

Workaround:
sed -ie "s/$.*kernel.*x86_64.*$\$/\1 nohz=off highres=off/" /boot/grub/grub.conf
reboot

With this kernel options the problem vanishes.

Comment 3 Steven Dake 2011-03-16 16:03:16 UTC

using the -p option to corosync may help alleviate this problem.

Comment 4 RHEL Program Management 2011-04-04 02:49:09 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Steven Dake 2011-07-22 16:27:56 UTC

This may be caused by Bugzilla #709758. See Comment #99.

There is no SLA for bugzilla.  Please contact your support representative and open a ticket if the problem persists after using RHEL6 with that bugzilla patch.  Your support rep can help with hotfixes if this problem is causing serious difficulty in deploying RHEL6.

*** This bug has been marked as a duplicate of bug 709758 ***