596705 – kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval

Bug 596705 - kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval

Summary: kernel-rt-2.6.33.4-rt20.18 soft-lock-up when running rteval

Keywords:
Status:	CLOSED DUPLICATE of bug 584153
Alias:	None
Product:	Red Hat Enterprise MRG
Classification:	Red Hat
Component:	realtime-kernel
Sub Component:
Version:	Development
Hardware:	x86_64
OS:	Linux
Priority:	low
Severity:	high
Target Milestone:	---
Target Release:	---
Assignee:	Red Hat Real Time Maintenance
QA Contact:	David Sommerseth
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2010-05-27 10:51 UTC by David Sommerseth
Modified:	2016-05-22 23:30 UTC (History)
CC List:	4 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2010-05-27 13:13:24 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)
Extract of the timed sysrq polling (241.23 KB, text/plain) 2010-05-27 10:51 UTC, David Sommerseth	no flags	Details
View All

Description David Sommerseth 2010-05-27 10:51:07 UTC

Created attachment 417194 [details]
Extract of the timed sysrq polling

Description of problem:
When running rteval the system enters into a kind of soft-lock-up state after a while.  One run needed ~11hours another one ~2hours.

Kernel responds to ping, and the serial console locked up while refreshing it.  It manage to print "Red Hat Enterprise Linux Server release 5.5 (Tikanga)" before going silent.  All SSH connections was dead.  It kind of accepted a connection (the client did not timeout), but did not respond at all with a SSL negotiation.  I don't recall now if I boosted the sshd priority to SCHED_FIFO with 75 as the priority, but I believe I did that.

In the last test, I ran rteval in one shell and in another shell I ran this command:

[root@hp-xw8400-01 ~]# (while /bin/true; do date; echo t > /proc/sysrq-trigger ; echo "###########"; dmesg; echo w > /proc/sysrq-trigger; echo "########### ***"; dmesg; sleep 900; done) > sysrq-log.txt

I've attached an extract of the sysrq-log.txt which contains the last run just a few minutes before the kernel locked up.

There are no traces in the log files which indicates troubles.  And no backtraces to the console.

Output from rteval:
------------------------------------------------------------------------
kcompile: ready to run
rteval run on 2.6.33.4-rt20.18.el5rt started at Wed May 26 15:44:22 2010
started 2 loads on 8 cores 
Run duration: 50400 seconds
starting cyclictest
cyclictest: running in SMP mode
cyclictest: starting with cmd: cyclictest -i100 -qm -d0 -h 2000 -p95 --smp
sending start event to all loads
waiting for duration (50400.000000)
kcompile: starting loop (jobs: 16)
hackbench: starting loop (jobs: 40)
rteval time remaining: 0 days, 13 hours, 49 minutes, 59 seconds
rteval time remaining: 0 days, 13 hours, 39 minutes, 58 seconds
rteval time remaining: 0 days, 13 hours, 29 minutes, 58 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 19 minutes, 57 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 13 hours, 9 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 59 minutes, 56 seconds
rteval time remaining: 0 days, 12 hours, 49 minutes, 55 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 39 minutes, 54 seconds
rteval time remaining: 0 days, 12 hours, 29 minutes, 54 seconds
kcompile: restarting compile job
rteval time remaining: 0 days, 12 hours, 19 minutes, 53 seconds
rteval time remaining: 0 days, 12 hours, 9 minutes, 52 seconds
rteval time remaining: 0 days, 11 hours, 59 minutes, 51 seconds
kcompile: restarting compile job
kcompile: restarting compile job
rteval time remaining: 0 days, 11 hours, 49 minutes, 51 seconds
------------------------------------------------------------------------

The last entry in /var/log/messages was logged 17:09:20 (ntpd[7666]: synchronized to 10.16.71.254, stratum 2).  The sysrq-log ran last time at 17:58:15.

The /var/log/maillog file indicates the load being high on the system, but not as high as it had been earlier:
------------------------------------------------------------------------
May 26 17:53:20 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 607
May 26 17:53:35 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 611
May 26 17:53:51 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 546
May 26 17:54:06 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 430
May 26 17:54:21 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 583
May 26 17:54:36 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 598
May 26 17:54:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 813
May 26 17:55:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 753
May 26 17:55:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 837
May 26 17:55:37 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 814
May 26 17:55:52 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 658
May 26 17:56:07 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 545
May 26 17:56:22 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 434
May 26 17:56:38 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 344
May 26 17:56:53 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 397
May 26 17:57:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 605
May 26 17:57:24 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 677
May 26 17:57:39 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 854
May 26 17:57:54 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 774
May 26 17:58:09 hp-xw8400-01 sendmail[7684]: rejecting connections on daemon MTA: load average: 610
------------------------------------------------------------------------

crond did nothing interesting as we know about:
------------------------------------------------------------------------
May 26 15:30:16 hp-xw8400-01 anacron[7750]: Normal exit (0 jobs run)
May 26 16:01:03 hp-xw8400-01 crond[18957]: (root) CMD (run-parts /etc/cron.hourly)
May 26 17:01:02 hp-xw8400-01 crond[4840]: (root) CMD (run-parts /etc/cron.hourly)
------------------------------------------------------------------------

Comment 1 David Sommerseth 2010-05-27 13:13:24 UTC


*** This bug has been marked as a duplicate of bug 584153 ***

Note You need to log in before you can comment on or make changes to this bug.