| Summary: | tickless kernel - hangs/delays when corosync ist started | ||
|---|---|---|---|
| Product: | Red Hat Enterprise Linux 6 | Reporter: | schurzi <redhat> |
| Component: | kernel | Assignee: | Red Hat Kernel Manager <kernel-mgr> |
| Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
| Severity: | high | Docs Contact: | |
| Priority: | unspecified | ||
| Version: | 6.0 | CC: | bruno.travouillon, leiwang, mburns, prarit, redhat, sdake |
| Target Milestone: | rc | ||
| Target Release: | --- | ||
| Hardware: | x86_64 | ||
| OS: | Linux | ||
| Whiteboard: | |||
| Fixed In Version: | Doc Type: | Bug Fix | |
| Doc Text: | Story Points: | --- | |
| Clone Of: | Environment: | ||
| Last Closed: | 2011-07-22 16:27:56 UTC | Type: | --- |
| Regression: | --- | Mount Type: | --- |
| Documentation: | --- | CRM: | |
| Verified Versions: | Category: | --- | |
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
| Cloudforms Team: | --- | Target Upstream Version: | |
|
Description
schurzi
2011-02-06 18:04:08 UTC
sorry, I hit the Enter button a bit to early ...
Description of problem:
The Server hangs (eg delayed responsetimes in SSH and all othe rapplications) for almost exact 5 Minutes and then runs fine for 0,25 - 5 hours. Then the server hangs again. This problem happens only on hardware Servers, I am unable to reproduce this behaviour with a Xen VM.
I could only reproduce the problem, when running directly on hardware with the corosync software started. When only corosync is strarted I could only observe delayed responsetimes. When pacemaker was running I could also observe a high CPU load (user and system) but there was never an increase in loadaverage.
Version-Release number of selected component (if applicable):
RHEL 6 latest patches
How reproducible:
We did Serveral fresh installs on two HP DL370 Servers with 2x Intel(R) Xeon(R) CPU X5550 and this was reproducible every time.
Steps to Reproduce:
1. Do a fresh install of RHEL6
2. yum update
3. yum install pacemaker
4. cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf
5. /etc/init.d/corosync start
It is not needed to put additional load in the server, this problem surfaces even on a idle system. The only running processes (except the default processes) were my SSH session and corosync.
I added a vmstat script, wich collected the system data.
---snip---
cat > /etc/init/vmstat.conf <<EOF
respawn
task
exec /usr/bin/vmstat -t 1 >> /tmp/vmstat.log &
EOF
initctl start vmstat
---snip---
Then I analysed the output with this script
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59
Actual results:
Rhe script returned the number of probes that have been taken. In a time with hangs you can see only 40 probes.
cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59
40 2011-02-01 09:32
34 2011-02-01 09:33
41 2011-02-01 09:34
40 2011-02-01 09:35
41 2011-02-01 09:36
As you see, there were only 40 vmstat probes per minute for full 5 minutes.
Expected results:
60 (or 59) probes logged all the time and no hangs.
Additional info:
I currently don't know what triggers this behaviour, but as soon, as corosync is stopped, the systems behaves normal. I can provide an example pacemaker configuration, so you can reproduce the high CPU usage as well, but I don't think this is nessesary. Since the high CPU utilisation most likely is a measuring error triggered by this problem.
One important note may be, that a Process which is running at these moments does not seem affected, but only the processes that are waken up. I discovered this when i tried to measure the latency of file opperaations in /proc. A Perl script which used sleep to wait one second did also generate only 40 Probes a minute. Then I rewrote the sleep to a while loop which cecked gettimeofday() to "sleep". Thus there was no voluntary context switch! This program generated 60 Probes everytime, even in time of hangs. I don't know if this problem is limited to scheduling of processes alone or if it is on a lower level.
Workaround:
sed -ie "s/\(.*kernel.*x86_64.*\)\$/\1 nohz=off highres=off/" /boot/grub/grub.conf
reboot
With this kernel options the problem vanishes.
using the -p option to corosync may help alleviate this problem. Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux. This may be caused by Bugzilla #709758. See Comment #99. There is no SLA for bugzilla. Please contact your support representative and open a ticket if the problem persists after using RHEL6 with that bugzilla patch. Your support rep can help with hotfixes if this problem is causing serious difficulty in deploying RHEL6. *** This bug has been marked as a duplicate of bug 709758 *** |