Bug 596705

Summary: kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval
Product: Red Hat Enterprise MRG Reporter: David Sommerseth <davids>
Component: realtime-kernelAssignee: Red Hat Real Time Maintenance <rt-maint>
Status: CLOSED DUPLICATE QA Contact: David Sommerseth <davids>
Severity: high Docs Contact:
Priority: low    
Version: DevelopmentCC: bhu, lgoncalv, ovasik, williams
Target Milestone: ---   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-05-27 13:13:24 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Extract of the timed sysrq polling none

Description David Sommerseth 2010-05-27 10:51:07 UTC
Created attachment 417194 [details]
Extract of the timed sysrq polling

Description of problem:
When running rteval the system enters into a kind of soft-lock-up state after a while.  One run needed ~11hours another one ~2hours.

Kernel responds to ping, and the serial console locked up while refreshing it.  It manage to print "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" before going silent.  All SSH connections was dead.  It kind of accepted a connection (the client did not timeout), but did not respond at all with a SSL negotiation.  I don't recall now if I boosted the sshd priority to SCHED_FIFO with 75 as the priority, but I believe I did that.

In the last test, I ran rteval in one shell and in another shell I ran this command:

[root@hp-xw8400-01 ~]# (while /bin/true; do date; echo t > /proc/sysrq-trigger ; echo "###########"; dmesg; echo w > /proc/sysrq-trigger; echo "########### ***"; dmesg; sleep 900; done) > sysrq-log.txt

I've attached an extract of the sysrq-log.txt which contains the last run just a few minutes before the kernel locked up.

There are no traces in the log files which indicates troubles.  And no backtraces to the console.

Output from rteval:
------------------------------------------------------------------------
kcompile: ready to run
rteval run on 2.6.33.4-rt20.18.el5rt started at Wed May 26 15:44:22 2010
started 2 loads on 8 cores 
Run duration: 50400 seconds
starting cyclictest
cyclictest: running in SMP mode
cyclictest: starting with cmd: cyclictest -i100 -qm -d0 -h 2000 -p95 --smp
sending start event to all loads
waiting for duration (50400.000000)
kcompile: starting loop (jobs: 16)
hackbench: starting loop (jobs: 40)
rteval time remaining: 0 days, 13 hours, 49 minutes, 59 seconds
rteval time remaining: 0 days, 13 hours, 39 minutes, 58 seconds
rteval time remaining: 0 days, 13 hours, 29 minutes, 58 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 19 minutes, 57 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 9 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 59 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 49 minutes, 55 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 39 minutes, 54 seconds
rteval time remaining: 0 days, 12 hours, 29 minutes, 54 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 19 minutes, 53 seconds
rteval time remaining: 0 days, 12 hours, 9 minutes, 52 seconds
rteval time remaining: 0 days, 11 hours, 59 minutes, 51 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 11 hours, 49 minutes, 51 seconds
------------------------------------------------------------------------

The last entry in /var/log/messages was logged 17:09:20 (ntpd[7666]: synchronized to 10.16.71.254, stratum 2).  The sysrq-log ran last time at 17:58:15.

The /var/log/maillog file indicates the load being high on the system, but not as high as it had been earlier:
------------------------------------------------------------------------
May 26 17:53:20 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 607
May 26 17:53:35 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 611
May 26 17:53:51 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 546
May 26 17:54:06 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 430
May 26 17:54:21 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 583
May 26 17:54:36 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 598
May 26 17:54:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 813
May 26 17:55:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 753
May 26 17:55:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 837
May 26 17:55:37 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 814
May 26 17:55:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 658
May 26 17:56:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 545
May 26 17:56:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 434
May 26 17:56:38 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 344
May 26 17:56:53 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 397
May 26 17:57:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 605
May 26 17:57:24 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 677
May 26 17:57:39 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 854
May 26 17:57:54 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 774
May 26 17:58:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 610
------------------------------------------------------------------------

crond did nothing interesting as we know about:
------------------------------------------------------------------------
May 26 15:30:16 hp-xw8400-01 anacron[7750]: Normal exit (0 jobs run)
May 26 16:01:03 hp-xw8400-01 crond[18957]: (root) CMD (run-parts /etc/cron.hourly)
May 26 17:01:02 hp-xw8400-01 crond[4840]: (root) CMD (run-parts /etc/cron.hourly)
------------------------------------------------------------------------

Comment 1 David Sommerseth 2010-05-27 13:13:24 UTC

*** This bug has been marked as a duplicate of bug 584153 ***