Bug 675568 - tickless kernel - hangs/delays when corosync ist started
Summary: tickless kernel - hangs/delays when corosync ist started
Keywords:
Status: CLOSED DUPLICATE of bug 709758
Alias: None
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: kernel
Version: 6.0
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: rc
: ---
Assignee: Red Hat Kernel Manager
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2011-02-06 18:04 UTC by schurzi
Modified: 2011-07-22 16:27 UTC (History)
6 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-07-22 16:27:56 UTC
Target Upstream Version:


Attachments (Terms of Use)

Description schurzi 2011-02-06 18:04:08 UTC
Description of problem:


Version-Release number of selected component (if applicable):


How reproducible:


Steps to Reproduce:
1.
2.
3.
  
Actual results:


Expected results:


Additional info:

Comment 1 schurzi 2011-02-06 18:05:28 UTC
sorry, I hit the Enter button a bit to early ...

Description of problem:
The Server hangs (eg delayed responsetimes in SSH and all othe rapplications) for almost exact 5 Minutes and then runs fine for 0,25 - 5 hours. Then the server hangs again. This problem happens only on hardware Servers, I am unable to reproduce this behaviour with a Xen VM.

I could only reproduce the problem, when running directly on hardware with the corosync software started. When only corosync is strarted I could only observe delayed responsetimes. When pacemaker was running I could also observe a high CPU load (user and system) but there was never an increase in loadaverage.

Version-Release number of selected component (if applicable):
RHEL 6 latest patches

How reproducible:
We did Serveral fresh installs on two HP DL370 Servers with 2x Intel(R) Xeon(R) CPU X5550 and this was reproducible every time.

Steps to Reproduce:
1. Do a fresh install of RHEL6
2. yum update
3. yum install pacemaker
4. cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
5. /etc/init.d/corosync start

It is not needed to put additional load in the server, this problem surfaces even on a idle system. The only running processes (except the default processes) were my SSH session and corosync.

I added a vmstat script, wich collected the system data.

---snip---
cat > /etc/init/vmstat.conf <<EOF
respawn
task
exec /usr/bin/vmstat -t 1 >> /tmp/vmstat.log &
EOF

initctl start vmstat
---snip---

Then I analysed the output with this script
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59

Actual results:
Rhe script returned the number of probes that have been taken. In a time with hangs you can see only 40 probes.
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59
40 2011-02-01 09:32
34 2011-02-01 09:33
41 2011-02-01 09:34
40 2011-02-01 09:35
41 2011-02-01 09:36

As you see, there were only 40 vmstat probes per minute for full 5 minutes.

Expected results:
60 (or 59) probes logged all the time and no hangs.

Additional info:
I currently don't know what triggers this behaviour, but as soon, as corosync is stopped, the systems behaves normal. I can provide an example pacemaker configuration, so you can reproduce the high CPU usage as well, but I don't think this is nessesary. Since the high CPU utilisation most likely is a measuring error triggered by this problem.

One important note may be, that a Process which is running at these moments does not seem affected, but only the processes that are waken up. I discovered this when i tried to measure the latency of file opperaations in /proc. A Perl script which used sleep to wait one second did also generate only 40 Probes a minute. Then I rewrote the sleep to a while loop which cecked gettimeofday() to "sleep". Thus there was no voluntary context switch! This program generated 60 Probes everytime, even in time of hangs. I don't know if this problem is limited to scheduling of processes alone or if it is on a lower level.

Workaround:
sed -ie "s/\(.*kernel.*x86_64.*\)\$/\1 nohz=off highres=off/" /boot/grub/grub.conf
reboot

With this kernel options the problem vanishes.

Comment 3 Steven Dake 2011-03-16 16:03:16 UTC
using the -p option to corosync may help alleviate this problem.

Comment 4 RHEL Program Management 2011-04-04 02:49:09 UTC
Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 5 Steven Dake 2011-07-22 16:27:56 UTC
This may be caused by Bugzilla #709758. See Comment #99.

There is no SLA for bugzilla.  Please contact your support representative and open a ticket if the problem persists after using RHEL6 with that bugzilla patch.  Your support rep can help with hotfixes if this problem is causing serious difficulty in deploying RHEL6.

*** This bug has been marked as a duplicate of bug 709758 ***


Note You need to log in before you can comment on or make changes to this bug.