Bug 584153

Summary:

RT tests fail due to watchdog timeout

Product:

Red Hat Enterprise MRG

Reporter:

D. Marlin <dmarlin>

Component:

realtime-kernel

Assignee:

Luis Claudio R. Goncalves <lgoncalv>

Status:

CLOSED NEXTRELEASE

QA Contact:

David Sommerseth <davids>

Severity:

medium

Docs Contact:

Priority:

high

Version:

Development

CC:

acme, bhu, davids, jeder, jkacur, lgoncalv, matt, ovasik, srostedt, tglx, williams

Target Milestone:

1.3

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2010-09-28 18:42:40 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
trace from rteval hang on the tyan-gt24-05	none
console from hung dell-per905-01	none
console from hung ibm-hs22-01	none
console from hung tyan-gt24-08	none
console from hung dell-pem610-01	none
consoles for Jun25 hangs on dell-pem610-01, tyan-gt24-08, and ibm-hs22-01	none
consoles for Jun29 hangs on pem610, tyan-gt24-08, hs22, ls42, and per905	none

Comment 1 D. Marlin 2010-04-26 16:16:39 UTC

At Clark's suggestion I've set up another test that adds the --verbose option and only runs for 4 hours.

  tests/kernel/rt-tests/rt_eval_debug

It uses the following command line to run the test:

  rteval --xmlrpc-submit=rtserver.farm.hsv.redhat.com --duration=4h --verbose 

I queued a job on one of the systems known to fail (watchdog timeout), and it did terminate with a timeout:

  https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=151715

Unfortunately there is not really any additional information in the logs.

Comment 2 D. Marlin 2010-04-29 19:35:32 UTC

We have added more debugging info to the test, including:

  the --debug option
  sysrq's every hour of the test

In addition, Luis has patched rteval to send output to the console to assist in debugging.

We performed a test on ibm-ls21-7972-01.rhts.eng.bos.redhat.com with this new version, but it did not fail (no watchdog).  

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=152894

We have queued several addition jobs running this version in order to see if it reproducible.

Comment 3 D. Marlin 2010-04-30 12:12:52 UTC

Results are still coming in, but the following jobs aborted (watchdog):

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153308
  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153309
  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153320
  

However, the following job completed successfully (no watchdog):

  http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153318

More results later today.

Comment 4 David Sommerseth 2010-05-27 13:13:24 UTC

*** Bug 596705 has been marked as a duplicate of this bug. ***

Comment 5 Beth Uptagrafft 2010-06-22 15:19:25 UTC

all recent manual rteval runs (8hr, 12hr, 18hr) in the farm (systems not in rhts) on the following kernels have completed:
	rt23.23 
	rt23.22 
	rt22.21

Comment 6 Beth Uptagrafft 2010-06-22 15:21:29 UTC

june 21, 2010 overnight runs:
- 2.6.33.5-rt22.21.1.el5rt debug kernel
- rteval-1.22-1.el5rt, rt-tests-0.71-1.el5rt, rt-setup-1.7-2.el5rt, rtctl-1.8-2.el5rt
- Red Hat Enterprise Linux Server release 5.5 (Tikanga)
- additional debug
	echo 9 > /proc/sysrq-trigger
	echo 1 > /proc/sys/kernel/panic_on_io_nmi
	echo 1 > /proc/sys/kernel/panic_on_oops
	echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 1 > /proc/sys/kernel/unknown_nmi_panic
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs
	echo 0 > /proc/sys/kernel/hung_task_panic
	echo 60 > /proc/sys/kernel/softlockup_thresh
	echo 1 > /proc/sys/kernel/softlockup_panic

Tests:
- hang on 16hr rteval run on the dell-pem610-01.  nothing on the ssh session. info from the console - http://pastebin.test.redhat.com/25348
note: scrollback was not set high enough, so that's all the info on the console.

- tyangt24-08 hung on a 16hr run.  nothing on the console. here's all i could see from the ssh session - http://pastebin.test.redhat.com/25349 

- a second tyan system, tyan-gt24-05, hung on a 16hr run with 9hr 19min to go, but there was nothing on the console and nothing on ssh

- ibm ls42 and hs22 systems completed their 15hr runs with no problems

Comment 8 Beth Uptagrafft 2010-06-24 14:19:49 UTC

Created attachment 426589 [details]
trace from rteval hang on the tyan-gt24-05

Comment 9 Beth Uptagrafft 2010-06-24 14:20:55 UTC

June 23, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel on the tyan-gt24-05.rhts.eng.bos.redhat.com

> rteval --xmlrpc-submit=rtserver.farm.hsv.redhat.com --workdir=./bhu-runs --duration=15h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Wed Jun 23 17:48:19 2010
started 2 loads on 8 cores  with 2 numa nodes
...
rteval time remaining: 0 days, 5 hours, 19 minutes, 30 seconds


Did not print out anything else in ssh window. See tyan-gt24-05-2010jun23 for trace from console.

Comment 11 Beth Uptagrafft 2010-06-25 14:40:19 UTC

Created attachment 426899 [details]
console from hung dell-per905-01

Comment 12 Beth Uptagrafft 2010-06-25 14:41:16 UTC

Created attachment 426900 [details]
console from hung ibm-hs22-01

Comment 13 Beth Uptagrafft 2010-06-25 14:42:21 UTC

Created attachment 426901 [details]
console from hung tyan-gt24-08

Comment 14 Beth Uptagrafft 2010-06-25 15:04:14 UTC

June 24, 2010 testing 
Configuration:
- 2.6.33.5-rt23.23.1.el5rt debug kernel
- 'nmi_watchdog=panic,1' to the kernel boot 
- proc settings:
	echo 9 > /proc/sysrq-trigger
	echo 0 > /proc/sys/kernel/panic_on_io_nmi
	echo 0 > /proc/sys/kernel/panic_on_oops
	echo 0 > /proc/sys/kernel/panic_on_unrecovered_nmi
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 0 > /proc/sys/kernel/unknown_nmi_panic
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs
	echo 0 > /proc/sys/kernel/hung_task_panic
	echo 60 > /proc/sys/kernel/softlockup_thresh
	echo 0 > /proc/sys/kernel/softlockup_panic


tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:22:57 2010
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 15 hours, 39 minutes, 52 seconds


can't ssh back into it. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426901



ibm-hs22-01.lab.bos.redhat.com
--------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:06:48 2010
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 9 minutes, 55 seconds

can't ssh back into it. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426900


dell-per905-01.rhts.eng.bos.redhat.com
----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:01:14 2010
  started 2 loads on 16 cores  with 4 numa nodes
  ...
  rteval time remaining: 0 days, 15 hours, 40 minutes, 12 seconds


can't ssh back in. console appears hung.
console output: https://bugzilla.redhat.com/attachment.cgi?id=426899

Comment 15 Beth Uptagrafft 2010-06-25 19:24:08 UTC

Created attachment 426964 [details]
console from hung dell-pem610-01

dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 15:57:42 2010
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 0 hours, 5 minutes, 54 seconds


can't ssh into it. ping still works. ssh window is not responding.
output is still going to the console.

Comment 16 Beth Uptagrafft 2010-06-28 15:08:02 UTC

Created attachment 427439 [details]
consoles for Jun25 hangs on dell-pem610-01,  tyan-gt24-08, and ibm-hs22-01

June 25, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel
Configuration:
- 'nmi_watchdog=panic,1' to kernel boot 
- proc settings
	echo 9 > /proc/sysrq-trigger
	echo 1 > /proc/sys/kernel/print-fatal-signals
	echo 900 > /proc/sys/kernel/hung_task_timeout_secs


dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:29:32 2010
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 12 hours, 15 minutes, 42 seconds

hung. can't ssh into the system, console not responding.
see jun25-dell-pem610-01-console

tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:31:09 2010
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 19 minutes, 54 seconds  

hung. can't ssh into the system, console not responding.
see jun24-tyan-gt24-08-console

ibm-hs22-01.lab.bos.redhat.com
--------
  rteval  --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:34:58 2010
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 17 hours, 49 minutes, 59 seconds @ 7:17

hung. can't ssh into the system, console not responding.
see jun25-ibm-hs22-01-console

Comment 17 John Kacur 2010-06-29 14:54:57 UTC

I created the following with an upstream -rt kernel running
sudo ./cyclictest -t32 -p99
while :; do ./hackbench 20 ; done

make O=/bld/tmp -j19 (kernel compile)

INFO: task kswapd0:547 blocked for more than 900 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
kswapd0       D 00000000ffffffff     0   547      2 0x00000008
 ffff88018d033950 0000000000000046 ffff88018d033918 0000000000000000
 ffff880100000000 000000000000af98 ffff8801a0131470 00000001014f8772
 ffff8800973902c0 ffff8801a01310c0 00000000000001f5 0000000000000000
Call Trace:
 [<ffffffff81511c2f>] schedule+0x20/0x52
 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324
 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4
 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d
 [<ffffffff811953e9>] generic_detach_inode+0x69/0x29c
 [<ffffffff811971a7>] generic_drop_inode+0x39/0x9e
 [<ffffffff81194dfb>] iput+0xd7/0xe6
 [<ffffffff8118fa1e>] dentry_iput+0x11f/0x12b
 [<ffffffff8118fc37>] d_kill+0xa9/0xb9
 [<ffffffff81190136>] __shrink_dcache_sb+0x4ef/0x771
 [<ffffffff81190538>] ? shrink_dcache_memory+0x155/0x271
 [<ffffffff811905aa>] shrink_dcache_memory+0x1c7/0x271
 [<ffffffff811301a2>] ? shrink_slab+0x53/0x208
 [<ffffffff811302a0>] shrink_slab+0x151/0x208
 [<ffffffff81130bbe>] balance_pgdat+0x4a0/0x785
 [<ffffffff8112cfe2>] ? isolate_pages_global+0x0/0x324
 [<ffffffff81513ec8>] ? rt_spin_unlock+0x2b/0x30
 [<ffffffff81131182>] kswapd+0x2df/0x2f5
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81044c0e>] ? need_resched+0x3f/0x46
 [<ffffffff81130ea3>] ? kswapd+0x0/0x2f5
 [<ffffffff8108beaa>] kthread+0xa4/0xac
 [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81514f80>] ? restore_args+0x0/0x30
 [<ffffffff8108be06>] ? kthread+0x0/0xac
 [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10
INFO: lockdep is turned off.
Kernel panic - not syncing: hung_task: blocked tasks
Dumping ftrace buffer:
   (ftrace buffer empty)

Comment 18 John Kacur 2010-06-30 09:36:19 UTC

INFO: task jbd2/md127p5-8:1737 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
jbd2/md127p5- D 00000000ffffffff     0  1737      2 0x00000000
 ffff88018c12da40 0000000000000046 ffff88018c12d9f0 ffffffff815147d4
 ffffffff810aed78 00000000955a4740 ffff88018b3ec4f0 ffff88018c54fa70
 ffff8800b63c5640 ffff88018b3ec140 ffff880100000000 ffffffff8104f1f7
Call Trace:
 [<ffffffff815147d4>] ? _raw_spin_lock_irq+0x89/0x90
 [<ffffffff810aed78>] ? task_blocks_on_rt_mutex+0x1d7/0x1e9
 [<ffffffff8104f1f7>] ? get_parent_ip+0x11/0x6f
 [<ffffffff81511c2f>] schedule+0x20/0x52
 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324
 [<ffffffff810e9fab>] ? __rcu_read_unlock+0x57/0x318
 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4
 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d
 [<ffffffff811a4a34>] __mark_inode_dirty+0xdc/0x29c
 [<ffffffff811adce1>] __set_page_dirty+0x17d/0x18a
 [<ffffffff811ade60>] mark_buffer_dirty+0x172/0x17a
 [<ffffffffa01d75d3>] __jbd2_journal_temp_unlink_buffer+0x202/0x20b [jbd2]
 [<ffffffffa01d75f2>] __jbd2_journal_unfile_buffer+0x16/0x22 [jbd2]
 [<ffffffffa01d7a21>] __jbd2_journal_refile_buffer+0xa4/0x13c [jbd2]
 [<ffffffffa01dc27d>] jbd2_journal_commit_transaction+0x16fb/0x1d5a [jbd2]
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81075f31>] ? del_timer_sync+0x0/0xce
 [<ffffffffa01e3ae3>] kjournald2+0x10b/0x311 [jbd2]
 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f
 [<ffffffff81044c0e>] ? need_resched+0x3f/0x46
 [<ffffffffa01e39d8>] ? kjournald2+0x0/0x311 [jbd2]
 [<ffffffff8108beaa>] kthread+0xa4/0xac
 [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10
 [<ffffffff81514f80>] ? restore_args+0x0/0x30
 [<ffffffff8108be06>] ? kthread+0x0/0xac
 [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10
INFO: lockdep is turned off.
Kernel panic - not syncing: hung_task: blocked tasks

Comment 19 Beth Uptagrafft 2010-06-30 14:35:14 UTC

Created attachment 428008 [details]
consoles for Jun29 hangs on pem610, tyan-gt24-08, hs22, ls42, and per905

June 29, 2010 testing using rostedt's 2.6.33.5-rt23.23.1.el5rt debug kernel built on June 28
Settings:
- 'nmi_watchdog=1' and 'earlyprintk=xxx'  to kernel boot 
- proc settings
	echo 9 > /proc/sysrq-trigger


dell-pem610-01.rhts.eng.bos.redhat.com
---------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:54:50 2010 (Eastern)
  started 2 loads on 16 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 10 hours, 39 minutes, 34 seconds


hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 22:41 screenlog.2 (Central)
see jun29-dell-pem610-01-console


tyan-gt24-08.rhts.eng.bos.redhat.com
-----
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:37:46 2010 (Eastern)
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 9 minutes, 54 seconds


hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 16:57 screenlog.4  (Central)
see jun29-tyan-gt24-08-console


ibm-ls42-01.lab.bos.redhat.com
-----------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:57:17 2010 (Eastern)
  started 2 loads on 8 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 16 hours, 19 minutes, 56 seconds

hung. can't ssh into the system, console not responding. 
last update to the console log: Jun 29 17:33 screenlog.6  (Central)
see jun29-ibm-ls42-01-console


ibm-hs22-01.lab.bos.redhat.com
--------
  rteval --workdir=./bhu-runs --duration=18h
  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 16:12:19 2010 (Eastern)
  started 2 loads on 4 cores  with 2 numa nodes
  ...
  rteval time remaining: 0 days, 4 hours, 55 minutes, 30 seconds

hung. can't ssh into the system, console not responding.
last update to the console log: Jun 30 04:28 screenlog.8  (Central)
see jun29-ibm-hs22-01-console


dell-per905-01.rhts.eng.bos.redhat.com
----
  rteval --workdir=./bhu-runs --duration=18h

  rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:12:23 2010 (Eastern)

  started 2 loads on 16 cores  with 4 numa nodes

  ...

  rteval time remaining: 0 days, 17 hours, 9 minutes, 57 seconds

hung. can't ssh into the system, console not responding.
last update to the console log:  Jun 29 16:30 screenlog.10 (Central)
see jun29-dell-per905-01-console

Comment 20 Steven Rostedt 2010-06-30 14:52:22 UTC

The console ends with this:

  .bkl_count                     : 1017

cfs_rq[0]:
  .exec_clock                    : 6712126.752263

You hit the bug in the NMI dump code that I just fixed. The sched_debug which prints out this grabs the rq spinlock here. And if the NMI preempted a place where the rq spinlock was already taken, it would lock up the box.

Thus, this is just a lockup due to the debug output, not the lockup we are looking for.

I'm currently building a new kernel with the debug fix.

Comment 21 D. Marlin 2010-07-05 15:19:48 UTC

I set up 18-hour runs of the 2.6.33.5-rt23.23.4.el5rt kernel on:

  dell-pem610-01
  dell-per905-01
  tyan-gt24-08
  ibm-ls42-01
  ibm-hs22-01

With the exception of ibm-ls42-01 all runs completed.  Apparently ibm-ls42-01 disconnected the console and ssh session about 8 hours into the test.  I'm not sure if the system reset or something else caused the connection loss.

The log is in the farm:

  rhel5.farm.hsv.redhat.com:/es/scratch/dmarlin/RT/screenlog.4.20100603

Comment 22 Clark Williams 2010-09-13 15:21:07 UTC

Dropping MRG 1.3 Blocker status. 

We'll tell customers that if they want to run a system with more than 4GB of RAM, they need to run a 64-bit kernel

Comment 23 Clark Williams 2010-09-28 18:42:40 UTC

This looks like an issue with HIGHMEM/PAE being used on very large memory servers. We've decided to ship our 32-bit kernels without PAE support, limiting them to 4GB of RAM. 

Closing