Bug 584153
Summary: | RT tests fail due to watchdog timeout | ||
---|---|---|---|
Product: | Red Hat Enterprise MRG | Reporter: | D. Marlin <dmarlin> |
Component: | realtime-kernel | Assignee: | Luis Claudio R. Goncalves <lgoncalv> |
Status: | CLOSED NEXTRELEASE | QA Contact: | David Sommerseth <davids> |
Severity: | medium | Docs Contact: | |
Priority: | high | ||
Version: | Development | CC: | acme, bhu, davids, jeder, jkacur, lgoncalv, matt, ovasik, srostedt, tglx, williams |
Target Milestone: | 1.3 | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2010-09-28 18:42:40 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Attachments: |
Comment 1
D. Marlin
2010-04-26 16:16:39 UTC
We have added more debugging info to the test, including: the --debug option sysrq's every hour of the test In addition, Luis has patched rteval to send output to the console to assist in debugging. We performed a test on ibm-ls21-7972-01.rhts.eng.bos.redhat.com with this new version, but it did not fail (no watchdog). http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=152894 We have queued several addition jobs running this version in order to see if it reproducible. Results are still coming in, but the following jobs aborted (watchdog): http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153308 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153309 http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153320 However, the following job completed successfully (no watchdog): http://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=153318 More results later today. *** Bug 596705 has been marked as a duplicate of this bug. *** all recent manual rteval runs (8hr, 12hr, 18hr) in the farm (systems not in rhts) on the following kernels have completed: rt23.23 rt23.22 rt22.21 june 21, 2010 overnight runs: - 2.6.33.5-rt22.21.1.el5rt debug kernel - rteval-1.22-1.el5rt, rt-tests-0.71-1.el5rt, rt-setup-1.7-2.el5rt, rtctl-1.8-2.el5rt - Red Hat Enterprise Linux Server release 5.5 (Tikanga) - additional debug echo 9 > /proc/sysrq-trigger echo 1 > /proc/sys/kernel/panic_on_io_nmi echo 1 > /proc/sys/kernel/panic_on_oops echo 1 > /proc/sys/kernel/panic_on_unrecovered_nmi echo 1 > /proc/sys/kernel/print-fatal-signals echo 1 > /proc/sys/kernel/unknown_nmi_panic echo 900 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo 60 > /proc/sys/kernel/softlockup_thresh echo 1 > /proc/sys/kernel/softlockup_panic Tests: - hang on 16hr rteval run on the dell-pem610-01. nothing on the ssh session. info from the console - http://pastebin.test.redhat.com/25348 note: scrollback was not set high enough, so that's all the info on the console. - tyangt24-08 hung on a 16hr run. nothing on the console. here's all i could see from the ssh session - http://pastebin.test.redhat.com/25349 - a second tyan system, tyan-gt24-05, hung on a 16hr run with 9hr 19min to go, but there was nothing on the console and nothing on ssh - ibm ls42 and hs22 systems completed their 15hr runs with no problems Created attachment 426589 [details]
trace from rteval hang on the tyan-gt24-05
June 23, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel on the tyan-gt24-05.rhts.eng.bos.redhat.com
> rteval --xmlrpc-submit=rtserver.farm.hsv.redhat.com --workdir=./bhu-runs --duration=15h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Wed Jun 23 17:48:19 2010
started 2 loads on 8 cores with 2 numa nodes
...
rteval time remaining: 0 days, 5 hours, 19 minutes, 30 seconds
Did not print out anything else in ssh window. See tyan-gt24-05-2010jun23 for trace from console.
Created attachment 426899 [details]
console from hung dell-per905-01
Created attachment 426900 [details]
console from hung ibm-hs22-01
Created attachment 426901 [details]
console from hung tyan-gt24-08
June 24, 2010 testing Configuration: - 2.6.33.5-rt23.23.1.el5rt debug kernel - 'nmi_watchdog=panic,1' to the kernel boot - proc settings: echo 9 > /proc/sysrq-trigger echo 0 > /proc/sys/kernel/panic_on_io_nmi echo 0 > /proc/sys/kernel/panic_on_oops echo 0 > /proc/sys/kernel/panic_on_unrecovered_nmi echo 1 > /proc/sys/kernel/print-fatal-signals echo 0 > /proc/sys/kernel/unknown_nmi_panic echo 900 > /proc/sys/kernel/hung_task_timeout_secs echo 0 > /proc/sys/kernel/hung_task_panic echo 60 > /proc/sys/kernel/softlockup_thresh echo 0 > /proc/sys/kernel/softlockup_panic tyan-gt24-08.rhts.eng.bos.redhat.com ----- rteval --workdir=./bhu-runs --duration=18h rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:22:57 2010 started 2 loads on 8 cores with 2 numa nodes ... rteval time remaining: 0 days, 15 hours, 39 minutes, 52 seconds can't ssh back into it. console appears hung. console output: https://bugzilla.redhat.com/attachment.cgi?id=426901 ibm-hs22-01.lab.bos.redhat.com -------- rteval --workdir=./bhu-runs --duration=18h rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:06:48 2010 started 2 loads on 4 cores with 2 numa nodes ... rteval time remaining: 0 days, 16 hours, 9 minutes, 55 seconds can't ssh back into it. console appears hung. console output: https://bugzilla.redhat.com/attachment.cgi?id=426900 dell-per905-01.rhts.eng.bos.redhat.com ---- rteval --workdir=./bhu-runs --duration=18h rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 16:01:14 2010 started 2 loads on 16 cores with 4 numa nodes ... rteval time remaining: 0 days, 15 hours, 40 minutes, 12 seconds can't ssh back in. console appears hung. console output: https://bugzilla.redhat.com/attachment.cgi?id=426899 Created attachment 426964 [details]
console from hung dell-pem610-01
dell-pem610-01.rhts.eng.bos.redhat.com
---------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Thu Jun 24 15:57:42 2010
started 2 loads on 16 cores with 2 numa nodes
...
rteval time remaining: 0 days, 0 hours, 5 minutes, 54 seconds
can't ssh into it. ping still works. ssh window is not responding.
output is still going to the console.
Created attachment 427439 [details]
consoles for Jun25 hangs on dell-pem610-01, tyan-gt24-08, and ibm-hs22-01
June 25, 2010 testing using the 2.6.33.5-rt23.23.1.el5rt debug kernel
Configuration:
- 'nmi_watchdog=panic,1' to kernel boot
- proc settings
echo 9 > /proc/sysrq-trigger
echo 1 > /proc/sys/kernel/print-fatal-signals
echo 900 > /proc/sys/kernel/hung_task_timeout_secs
dell-pem610-01.rhts.eng.bos.redhat.com
---------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:29:32 2010
started 2 loads on 16 cores with 2 numa nodes
...
rteval time remaining: 0 days, 12 hours, 15 minutes, 42 seconds
hung. can't ssh into the system, console not responding.
see jun25-dell-pem610-01-console
tyan-gt24-08.rhts.eng.bos.redhat.com
-----
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:31:09 2010
started 2 loads on 8 cores with 2 numa nodes
...
rteval time remaining: 0 days, 16 hours, 19 minutes, 54 seconds
hung. can't ssh into the system, console not responding.
see jun24-tyan-gt24-08-console
ibm-hs22-01.lab.bos.redhat.com
--------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Fri Jun 25 18:34:58 2010
started 2 loads on 4 cores with 2 numa nodes
...
rteval time remaining: 0 days, 17 hours, 49 minutes, 59 seconds @ 7:17
hung. can't ssh into the system, console not responding.
see jun25-ibm-hs22-01-console
I created the following with an upstream -rt kernel running sudo ./cyclictest -t32 -p99 while :; do ./hackbench 20 ; done make O=/bld/tmp -j19 (kernel compile) INFO: task kswapd0:547 blocked for more than 900 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. kswapd0 D 00000000ffffffff 0 547 2 0x00000008 ffff88018d033950 0000000000000046 ffff88018d033918 0000000000000000 ffff880100000000 000000000000af98 ffff8801a0131470 00000001014f8772 ffff8800973902c0 ffff8801a01310c0 00000000000001f5 0000000000000000 Call Trace: [<ffffffff81511c2f>] schedule+0x20/0x52 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d [<ffffffff811953e9>] generic_detach_inode+0x69/0x29c [<ffffffff811971a7>] generic_drop_inode+0x39/0x9e [<ffffffff81194dfb>] iput+0xd7/0xe6 [<ffffffff8118fa1e>] dentry_iput+0x11f/0x12b [<ffffffff8118fc37>] d_kill+0xa9/0xb9 [<ffffffff81190136>] __shrink_dcache_sb+0x4ef/0x771 [<ffffffff81190538>] ? shrink_dcache_memory+0x155/0x271 [<ffffffff811905aa>] shrink_dcache_memory+0x1c7/0x271 [<ffffffff811301a2>] ? shrink_slab+0x53/0x208 [<ffffffff811302a0>] shrink_slab+0x151/0x208 [<ffffffff81130bbe>] balance_pgdat+0x4a0/0x785 [<ffffffff8112cfe2>] ? isolate_pages_global+0x0/0x324 [<ffffffff81513ec8>] ? rt_spin_unlock+0x2b/0x30 [<ffffffff81131182>] kswapd+0x2df/0x2f5 [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f [<ffffffff81044c0e>] ? need_resched+0x3f/0x46 [<ffffffff81130ea3>] ? kswapd+0x0/0x2f5 [<ffffffff8108beaa>] kthread+0xa4/0xac [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81514f80>] ? restore_args+0x0/0x30 [<ffffffff8108be06>] ? kthread+0x0/0xac [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10 INFO: lockdep is turned off. Kernel panic - not syncing: hung_task: blocked tasks Dumping ftrace buffer: (ftrace buffer empty) INFO: task jbd2/md127p5-8:1737 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. jbd2/md127p5- D 00000000ffffffff 0 1737 2 0x00000000 ffff88018c12da40 0000000000000046 ffff88018c12d9f0 ffffffff815147d4 ffffffff810aed78 00000000955a4740 ffff88018b3ec4f0 ffff88018c54fa70 ffff8800b63c5640 ffff88018b3ec140 ffff880100000000 ffffffff8104f1f7 Call Trace: [<ffffffff815147d4>] ? _raw_spin_lock_irq+0x89/0x90 [<ffffffff810aed78>] ? task_blocks_on_rt_mutex+0x1d7/0x1e9 [<ffffffff8104f1f7>] ? get_parent_ip+0x11/0x6f [<ffffffff81511c2f>] schedule+0x20/0x52 [<ffffffff81512fbc>] rt_spin_lock_slowlock+0x236/0x324 [<ffffffff810e9fab>] ? __rcu_read_unlock+0x57/0x318 [<ffffffff810af7ec>] rt_spin_lock_fastlock.clone.1+0xcf/0xd4 [<ffffffff81513e25>] rt_spin_lock+0x16/0x3d [<ffffffff811a4a34>] __mark_inode_dirty+0xdc/0x29c [<ffffffff811adce1>] __set_page_dirty+0x17d/0x18a [<ffffffff811ade60>] mark_buffer_dirty+0x172/0x17a [<ffffffffa01d75d3>] __jbd2_journal_temp_unlink_buffer+0x202/0x20b [jbd2] [<ffffffffa01d75f2>] __jbd2_journal_unfile_buffer+0x16/0x22 [jbd2] [<ffffffffa01d7a21>] __jbd2_journal_refile_buffer+0xa4/0x13c [jbd2] [<ffffffffa01dc27d>] jbd2_journal_commit_transaction+0x16fb/0x1d5a [jbd2] [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f [<ffffffff81075f31>] ? del_timer_sync+0x0/0xce [<ffffffffa01e3ae3>] kjournald2+0x10b/0x311 [jbd2] [<ffffffff8108c39a>] ? autoremove_wake_function+0x0/0x4f [<ffffffff81044c0e>] ? need_resched+0x3f/0x46 [<ffffffffa01e39d8>] ? kjournald2+0x0/0x311 [jbd2] [<ffffffff8108beaa>] kthread+0xa4/0xac [<ffffffff810045d4>] kernel_thread_helper+0x4/0x10 [<ffffffff81514f80>] ? restore_args+0x0/0x30 [<ffffffff8108be06>] ? kthread+0x0/0xac [<ffffffff810045d0>] ? kernel_thread_helper+0x0/0x10 INFO: lockdep is turned off. Kernel panic - not syncing: hung_task: blocked tasks Created attachment 428008 [details]
consoles for Jun29 hangs on pem610, tyan-gt24-08, hs22, ls42, and per905
June 29, 2010 testing using rostedt's 2.6.33.5-rt23.23.1.el5rt debug kernel built on June 28
Settings:
- 'nmi_watchdog=1' and 'earlyprintk=xxx' to kernel boot
- proc settings
echo 9 > /proc/sysrq-trigger
dell-pem610-01.rhts.eng.bos.redhat.com
---------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:54:50 2010 (Eastern)
started 2 loads on 16 cores with 2 numa nodes
...
rteval time remaining: 0 days, 10 hours, 39 minutes, 34 seconds
hung. can't ssh into the system, console not responding.
last update to the console log: Jun 29 22:41 screenlog.2 (Central)
see jun29-dell-pem610-01-console
tyan-gt24-08.rhts.eng.bos.redhat.com
-----
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:37:46 2010 (Eastern)
started 2 loads on 8 cores with 2 numa nodes
...
rteval time remaining: 0 days, 16 hours, 9 minutes, 54 seconds
hung. can't ssh into the system, console not responding.
last update to the console log: Jun 29 16:57 screenlog.4 (Central)
see jun29-tyan-gt24-08-console
ibm-ls42-01.lab.bos.redhat.com
-----------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:57:17 2010 (Eastern)
started 2 loads on 8 cores with 2 numa nodes
...
rteval time remaining: 0 days, 16 hours, 19 minutes, 56 seconds
hung. can't ssh into the system, console not responding.
last update to the console log: Jun 29 17:33 screenlog.6 (Central)
see jun29-ibm-ls42-01-console
ibm-hs22-01.lab.bos.redhat.com
--------
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 16:12:19 2010 (Eastern)
started 2 loads on 4 cores with 2 numa nodes
...
rteval time remaining: 0 days, 4 hours, 55 minutes, 30 seconds
hung. can't ssh into the system, console not responding.
last update to the console log: Jun 30 04:28 screenlog.8 (Central)
see jun29-ibm-hs22-01-console
dell-per905-01.rhts.eng.bos.redhat.com
----
rteval --workdir=./bhu-runs --duration=18h
rteval run on 2.6.33.5-rt23.23.1.el5rt started at Tue Jun 29 15:12:23 2010 (Eastern)
started 2 loads on 16 cores with 4 numa nodes
...
rteval time remaining: 0 days, 17 hours, 9 minutes, 57 seconds
hung. can't ssh into the system, console not responding.
last update to the console log: Jun 29 16:30 screenlog.10 (Central)
see jun29-dell-per905-01-console
The console ends with this: .bkl_count : 1017 cfs_rq[0]: .exec_clock : 6712126.752263 You hit the bug in the NMI dump code that I just fixed. The sched_debug which prints out this grabs the rq spinlock here. And if the NMI preempted a place where the rq spinlock was already taken, it would lock up the box. Thus, this is just a lockup due to the debug output, not the lockup we are looking for. I'm currently building a new kernel with the debug fix. I set up 18-hour runs of the 2.6.33.5-rt23.23.4.el5rt kernel on: dell-pem610-01 dell-per905-01 tyan-gt24-08 ibm-ls42-01 ibm-hs22-01 With the exception of ibm-ls42-01 all runs completed. Apparently ibm-ls42-01 disconnected the console and ssh session about 8 hours into the test. I'm not sure if the system reset or something else caused the connection loss. The log is in the farm: rhel5.farm.hsv.redhat.com:/es/scratch/dmarlin/RT/screenlog.4.20100603 Dropping MRG 1.3 Blocker status. We'll tell customers that if they want to run a system with more than 4GB of RAM, they need to run a 64-bit kernel This looks like an issue with HIGHMEM/PAE being used on very large memory servers. We've decided to ship our 32-bit kernels without PAE support, limiting them to 4GB of RAM. Closing |