Bug 596705
Summary: | kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval | ||||||
---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | David Sommerseth <davids> | ||||
Component: | realtime-kernel | Assignee: | Red Hat Real Time Maintenance <rt-maint> | ||||
Status: | CLOSED DUPLICATE | QA Contact: | David Sommerseth <davids> | ||||
Severity: | high | Docs Contact: | |||||
Priority: | low | ||||||
Version: | Development | CC: | bhu, lgoncalv, ovasik, williams | ||||
Target Milestone: | --- | ||||||
Target Release: | --- | ||||||
Hardware: | x86_64 | ||||||
OS: | Linux | ||||||
Whiteboard: | |||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||
Doc Text: | Story Points: | --- | |||||
Clone Of: | Environment: | ||||||
Last Closed: | 2010-05-27 13:13:24 UTC | Type: | --- | ||||
Regression: | --- | Mount Type: | --- | ||||
Documentation: | --- | CRM: | |||||
Verified Versions: | Category: | --- | |||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
Cloudforms Team: | --- | Target Upstream Version: | |||||
Embargoed: | |||||||
Attachments: |
|
*** This bug has been marked as a duplicate of bug 584153 *** |
Created attachment 417194 [details] Extract of the timed sysrq polling Description of problem: When running rteval the system enters into a kind of soft-lock-up state after a while. One run needed ~11hours another one ~2hours. Kernel responds to ping, and the serial console locked up while refreshing it. It manage to print "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" before going silent. All SSH connections was dead. It kind of accepted a connection (the client did not timeout), but did not respond at all with a SSL negotiation. I don't recall now if I boosted the sshd priority to SCHED_FIFO with 75 as the priority, but I believe I did that. In the last test, I ran rteval in one shell and in another shell I ran this command: [root@hp-xw8400-01 ~]# (while /bin/true; do date; echo t > /proc/sysrq-trigger ; echo "###########"; dmesg; echo w > /proc/sysrq-trigger; echo "########### ***"; dmesg; sleep 900; done) > sysrq-log.txt I've attached an extract of the sysrq-log.txt which contains the last run just a few minutes before the kernel locked up. There are no traces in the log files which indicates troubles. And no backtraces to the console. Output from rteval: ------------------------------------------------------------------------ kcompile: ready to run rteval run on 2.6.33.4-rt20.18.el5rt started at Wed May 26 15:44:22 2010 started 2 loads on 8 cores Run duration: 50400 seconds starting cyclictest cyclictest: running in SMP mode cyclictest: starting with cmd: cyclictest -i100 -qm -d0 -h 2000 -p95 --smp sending start event to all loads waiting for duration (50400.000000) kcompile: starting loop (jobs: 16) hackbench: starting loop (jobs: 40) rteval time remaining: 0 days, 13 hours, 49 minutes, 59 seconds rteval time remaining: 0 days, 13 hours, 39 minutes, 58 seconds rteval time remaining: 0 days, 13 hours, 29 minutes, 58 seconds kcompile: restarting compile job kcompile: restarting compile job rteval time remaining: 0 days, 13 hours, 19 minutes, 57 seconds kcompile: restarting compile job kcompile: restarting compile job rteval time remaining: 0 days, 13 hours, 9 minutes, 56 seconds rteval time remaining: 0 days, 12 hours, 59 minutes, 56 seconds rteval time remaining: 0 days, 12 hours, 49 minutes, 55 seconds kcompile: restarting compile job rteval time remaining: 0 days, 12 hours, 39 minutes, 54 seconds rteval time remaining: 0 days, 12 hours, 29 minutes, 54 seconds kcompile: restarting compile job rteval time remaining: 0 days, 12 hours, 19 minutes, 53 seconds rteval time remaining: 0 days, 12 hours, 9 minutes, 52 seconds rteval time remaining: 0 days, 11 hours, 59 minutes, 51 seconds kcompile: restarting compile job kcompile: restarting compile job rteval time remaining: 0 days, 11 hours, 49 minutes, 51 seconds ------------------------------------------------------------------------ The last entry in /var/log/messages was logged 17:09:20 (ntpd[7666]: synchronized to 10.16.71.254, stratum 2). The sysrq-log ran last time at 17:58:15. The /var/log/maillog file indicates the load being high on the system, but not as high as it had been earlier: ------------------------------------------------------------------------ May 26 17:53:20 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 607 May 26 17:53:35 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 611 May 26 17:53:51 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 546 May 26 17:54:06 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 430 May 26 17:54:21 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 583 May 26 17:54:36 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 598 May 26 17:54:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 813 May 26 17:55:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 753 May 26 17:55:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 837 May 26 17:55:37 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 814 May 26 17:55:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 658 May 26 17:56:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 545 May 26 17:56:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 434 May 26 17:56:38 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 344 May 26 17:56:53 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 397 May 26 17:57:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 605 May 26 17:57:24 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 677 May 26 17:57:39 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 854 May 26 17:57:54 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 774 May 26 17:58:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 610 ------------------------------------------------------------------------ crond did nothing interesting as we know about: ------------------------------------------------------------------------ May 26 15:30:16 hp-xw8400-01 anacron[7750]: Normal exit (0 jobs run) May 26 16:01:03 hp-xw8400-01 crond[18957]: (root) CMD (run-parts /etc/cron.hourly) May 26 17:01:02 hp-xw8400-01 crond[4840]: (root) CMD (run-parts /etc/cron.hourly) ------------------------------------------------------------------------