Hide Forgot
Description of problem: Version-Release number of selected component (if applicable): How reproducible: Steps to Reproduce: 1. 2. 3. Actual results: Expected results: Additional info:
sorry, I hit the Enter button a bit to early ... Description of problem: The Server hangs (eg delayed responsetimes in SSH and all othe rapplications) for almost exact 5 Minutes and then runs fine for 0,25 - 5 hours. Then the server hangs again. This problem happens only on hardware Servers, I am unable to reproduce this behaviour with a Xen VM. I could only reproduce the problem, when running directly on hardware with the corosync software started. When only corosync is strarted I could only observe delayed responsetimes. When pacemaker was running I could also observe a high CPU load (user and system) but there was never an increase in loadaverage. Version-Release number of selected component (if applicable): RHEL 6 latest patches How reproducible: We did Serveral fresh installs on two HP DL370 Servers with 2x Intel(R) Xeon(R) CPU X5550 and this was reproducible every time. Steps to Reproduce: 1. Do a fresh install of RHEL6 2. yum update 3. yum install pacemaker 4. cp /etc/corosync/corosync.conf.example /etc/corosync/corosync.conf 5. /etc/init.d/corosync start It is not needed to put additional load in the server, this problem surfaces even on a idle system. The only running processes (except the default processes) were my SSH session and corosync. I added a vmstat script, wich collected the system data. ---snip--- cat > /etc/init/vmstat.conf <<EOF respawn task exec /usr/bin/vmstat -t 1 >> /tmp/vmstat.log & EOF initctl start vmstat ---snip--- Then I analysed the output with this script cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59 Actual results: Rhe script returned the number of probes that have been taken. In a time with hangs you can see only 40 probes. cat /tmp/vmstat.log | awk '{print $18" "$19}' | cut -d: -f 1,2 | sort | uniq -c | grep -v 60 |grep -v 59 40 2011-02-01 09:32 34 2011-02-01 09:33 41 2011-02-01 09:34 40 2011-02-01 09:35 41 2011-02-01 09:36 As you see, there were only 40 vmstat probes per minute for full 5 minutes. Expected results: 60 (or 59) probes logged all the time and no hangs. Additional info: I currently don't know what triggers this behaviour, but as soon, as corosync is stopped, the systems behaves normal. I can provide an example pacemaker configuration, so you can reproduce the high CPU usage as well, but I don't think this is nessesary. Since the high CPU utilisation most likely is a measuring error triggered by this problem. One important note may be, that a Process which is running at these moments does not seem affected, but only the processes that are waken up. I discovered this when i tried to measure the latency of file opperaations in /proc. A Perl script which used sleep to wait one second did also generate only 40 Probes a minute. Then I rewrote the sleep to a while loop which cecked gettimeofday() to "sleep". Thus there was no voluntary context switch! This program generated 60 Probes everytime, even in time of hangs. I don't know if this problem is limited to scheduling of processes alone or if it is on a lower level. Workaround: sed -ie "s/\(.*kernel.*x86_64.*\)\$/\1 nohz=off highres=off/" /boot/grub/grub.conf reboot With this kernel options the problem vanishes.
using the -p option to corosync may help alleviate this problem.
Since RHEL 6.1 External Beta has begun, and this bug remains unresolved, it has been rejected as it is not proposed as exception or blocker. Red Hat invites you to ask your support representative to propose this request, if appropriate and relevant, in the next release of Red Hat Enterprise Linux.
This may be caused by Bugzilla #709758. See Comment #99. There is no SLA for bugzilla. Please contact your support representative and open a ticket if the problem persists after using RHEL6 with that bugzilla patch. Your support rep can help with hotfixes if this problem is causing serious difficulty in deploying RHEL6. *** This bug has been marked as a duplicate of bug 709758 ***