Bug 596705 - kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval
kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval
Status: CLOSED DUPLICATE of bug 584153
Product: Red Hat Enterprise MRG
Classification: Red Hat
Component: realtime-kernel (Show other bugs)
Development
x86_64 Linux
low Severity high
: ---
: ---
Assigned To: Red Hat Real Time Maintenance
David Sommerseth
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2010-05-27 06:51 EDT by David Sommerseth
Modified: 2016-05-22 19:30 EDT (History)
4 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2010-05-27 09:13:24 EDT
Type: ---
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
Extract of the timed sysrq polling (241.23 KB, text/plain)
2010-05-27 06:51 EDT, David Sommerseth
no flags Details

  None (edit)
Description David Sommerseth 2010-05-27 06:51:07 EDT
Created attachment 417194 [details]
Extract of the timed sysrq polling

Description of problem:
When running rteval the system enters into a kind of soft-lock-up state after a while.  One run needed ~11hours another one ~2hours.

Kernel responds to ping, and the serial console locked up while refreshing it.  It manage to print "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" before going silent.  All SSH connections was dead.  It kind of accepted a connection (the client did not timeout), but did not respond at all with a SSL negotiation.  I don't recall now if I boosted the sshd priority to SCHED_FIFO with 75 as the priority, but I believe I did that.

In the last test, I ran rteval in one shell and in another shell I ran this command:

[root@hp-xw8400-01 ~]# (while /bin/true; do date; echo t > /proc/sysrq-trigger ; echo "###########"; dmesg; echo w > /proc/sysrq-trigger; echo "########### ***"; dmesg; sleep 900; done) > sysrq-log.txt

I've attached an extract of the sysrq-log.txt which contains the last run just a few minutes before the kernel locked up.

There are no traces in the log files which indicates troubles.  And no backtraces to the console.

Output from rteval:
------------------------------------------------------------------------
kcompile: ready to run
rteval run on 2.6.33.4-rt20.18.el5rt started at Wed May 26 15:44:22 2010
started 2 loads on 8 cores 
Run duration: 50400 seconds
starting cyclictest
cyclictest: running in SMP mode
cyclictest: starting with cmd: cyclictest -i100 -qm -d0 -h 2000 -p95 --smp
sending start event to all loads
waiting for duration (50400.000000)
kcompile: starting loop (jobs: 16)
hackbench: starting loop (jobs: 40)
rteval time remaining: 0 days, 13 hours, 49 minutes, 59 seconds
rteval time remaining: 0 days, 13 hours, 39 minutes, 58 seconds
rteval time remaining: 0 days, 13 hours, 29 minutes, 58 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 19 minutes, 57 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 9 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 59 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 49 minutes, 55 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 39 minutes, 54 seconds
rteval time remaining: 0 days, 12 hours, 29 minutes, 54 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 19 minutes, 53 seconds
rteval time remaining: 0 days, 12 hours, 9 minutes, 52 seconds
rteval time remaining: 0 days, 11 hours, 59 minutes, 51 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 11 hours, 49 minutes, 51 seconds
------------------------------------------------------------------------

The last entry in /var/log/messages was logged 17:09:20 (ntpd[7666]: synchronized to 10.16.71.254, stratum 2).  The sysrq-log ran last time at 17:58:15.

The /var/log/maillog file indicates the load being high on the system, but not as high as it had been earlier:
------------------------------------------------------------------------
May 26 17:53:20 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 607
May 26 17:53:35 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 611
May 26 17:53:51 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 546
May 26 17:54:06 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 430
May 26 17:54:21 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 583
May 26 17:54:36 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 598
May 26 17:54:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 813
May 26 17:55:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 753
May 26 17:55:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 837
May 26 17:55:37 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 814
May 26 17:55:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 658
May 26 17:56:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 545
May 26 17:56:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 434
May 26 17:56:38 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 344
May 26 17:56:53 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 397
May 26 17:57:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 605
May 26 17:57:24 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 677
May 26 17:57:39 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 854
May 26 17:57:54 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 774
May 26 17:58:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 610
------------------------------------------------------------------------

crond did nothing interesting as we know about:
------------------------------------------------------------------------
May 26 15:30:16 hp-xw8400-01 anacron[7750]: Normal exit (0 jobs run)
May 26 16:01:03 hp-xw8400-01 crond[18957]: (root) CMD (run-parts /etc/cron.hourly)
May 26 17:01:02 hp-xw8400-01 crond[4840]: (root) CMD (run-parts /etc/cron.hourly)
------------------------------------------------------------------------
Comment 1 David Sommerseth 2010-05-27 09:13:24 EDT

*** This bug has been marked as a duplicate of bug 584153 ***

Note You need to log in before you can comment on or make changes to this bug.