Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

Bug 787331

Summary:

System hang running Java tests on 3.0.9-rt26.45.el6rt.x86_64 kernel

Product:

Red Hat Enterprise MRG

Reporter:

IBM Bug Proxy <bugproxy>

Component:

realtime-kernel

Assignee:

Steven Rostedt <srostedt>

Status:

CLOSED ERRATA

QA Contact:

David Sommerseth <davids>

Severity:

urgent

Docs Contact:

Priority:

urgent

Version:

2.1

CC:

bhu, iboverma, jkachuck, jkacur, jkastner, lgoncalv, ovasik, pportant, wgomerin, williams

Target Milestone:

2.1.5

Target Release:

---

Hardware:

x86_64

OS:

All

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

* Cause: realtime workqueue patch introduced races conditions in hotplug * Consequence: system hang in cpu hotplug code * Fix: revert (unneeded) pair of workqueue patches * Result: no system hang

Story Points:

---

Clone Of:

Environment:

Last Closed:

2012-04-18 20:03:56 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
NMI SOL output - oak test job	none
NMI SOL output - oak test job Extended Run Feb 3rd RTJAVM4 (LS21)	none
Paul's Speculative Patch	none
messages rtjavm4 Feb 6 2012	none
Paul's Speculative Patch V2	none
sync cpu down with notifiers (based against 3.2.9-rt16)	none
sync cpu down with notifiers (based against 3.0.20-rt36)	none
Revert workqueue patch 1 (3.0.20-rt36)	none
Revert workqueue patch 2 (3.0.20-rt36)	none

Description IBM Bug Proxy 2012-02-04 02:00:56 UTC

-- Problem Description --
IBM JTC running tests on AMD system.  Test information is:

We ran with HEAD stream for WRT, which is similar to WRT-V3.
The jobs that hanged were jltf copse and oak running -Xrealtime (one with
default -Xjit and the other with -Xnojit).

When the system hangs, it is not possible to gather information via magic sysrq.  Have passed on the following suggestions in an attempt to get more information.

- Enable RT bandwidth limiting

- Enable panic on NMI

- Rerun with the above enabled

- If hangs, attempt magic sysrq

- If magic sysrq still unresponsive, send NMI



oak job NMI output screenshot (KVM)

Unfortunately, this AMM is not enabled to send NMI to blade.      Here is a screen of NMI output that went to KVM console rather than SOL.

MagicSysRq did not work.

I am running a test on another machine where the AMM is capable of issuing NMI to blade.

redirected kernel output to console on another test server, here is the complete stack trace in something more consumable:


rtjavm4 login: ------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:472!
invalid opcode: 0000 [#1] PREEMPT SMP 
CPU 1 
Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc ipv6 bnx2 sg serio_raw pcspkr k8temp amd64_edac_mod edac_core shpchp ext4 mbcache jbd2 crc16 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas ehci_hcd ohci_hcd button radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core power_supply dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 28091, comm: RT Not tainted 3.0.9-rt26.45.el6rt.x86_64 #1 IBM BladeCenter LS21 -[7971AC1]-/Server Blade
RIP: 0010:[<ffffffff8106f5ef>]  [<ffffffff8106f5ef>] task_blocks_on_rt_mutex+0x76/0x211
RSP: 0000:ffff880224939c68  EFLAGS: 00010082
RAX: ffff880224b42700 RBX: ffff880129813cb8 RCX: 0000000000000001
RDX: 000000000000d9d9 RSI: ffff880129813cb8 RDI: ffff88012a2688c0
RBP: ffff880224939cc8 R08: ffff88012a2682c0 R09: 2c58000000000000
R10: ffff880224939c78 R11: 0000000000000000 R12: ffff880224b42701
R13: ffff88012a2688c0 R14: ffff88022489c850 R15: 0000000000000001
FS:  00007f313ce7a700(0000) GS:ffff88022fc00000(0063) knlGS:00000000a29e7b70
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000a3722000 CR3: 000000022a1e3000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process RT (pid: 28091, threadinfo ffff880224938000, task ffff880224b42700)
Stack:
 0000000000000000 0000000000000000 ffff880224939dd8 ffff88012a2682c0
 0000000000000292 ffff880224939c78 2c58000000000000 ffff88022489c850
 ffff880129813cb8 ffff88012a2682c0 0000000000000001 0000000000000001
Call Trace:
 [<ffffffff8106f7df>] rt_mutex_start_proxy_lock+0x55/0x8a
 [<ffffffff8106dd6d>] futex_requeue+0x498/0x607
 [<ffffffff810dd319>] ? pagefault_enable+0x1d/0x1f
 [<ffffffff8106e733>] do_futex+0x857/0x8a1
 [<ffffffff810a2560>] ? rb_reserve_next_event+0x73/0x23f
 [<ffffffff81038c0d>] ? finish_task_switch+0x52/0xab
 [<ffffffff810a62f8>] ? __trace_buffer_unlock_commit+0x3c/0x45
 [<ffffffff810a630d>] ? trace_nowake_buffer_unlock_commit+0xc/0xe
 [<ffffffff8106e9e4>] compat_sys_futex+0x128/0x140
 [<ffffffff813a03b3>] ia32_do_call+0x13/0x13
Code: 20 49 83 b8 18 06 00 00 01 75 16 48 8b 75 c0 4c 89 ef e8 a6 9d 32 00 b8 f5 ff ff ff e9 96 01 00 00 49 83 b8 18 06 00 00 01 76 04 <0f> 0b eb fe 4c 89 c7 4c 89 45 b8 e8 c5 f8 ff ff 4c 8b 45 b8 4c 
RIP  [<ffffffff8106f5ef>] task_blocks_on_rt_mutex+0x76/0x211
 RSP <ffff880224939c68>
---[ end trace 0000000000000002 ]---
note: RT[28091] exited with preempt_count 2
sched: RT throttling activated
------------[ cut here ]------------
kernel BUG at kernel/rtmutex.c:724!
invalid opcode: 0000 [#2] PREEMPT SMP 
CPU 1 
Modules linked in: nfs lockd auth_rpcgss nfs_acl sunrpc ipv6 bnx2 sg serio_raw pcspkr k8temp amd64_edac_mod edac_core shpchp ext4 mbcache jbd2 crc16 sd_mod crc_t10dif mptsas mptscsih mptbase scsi_transport_sas ehci_hcd ohci_hcd button radeon ttm drm_kms_helper drm hwmon i2c_algo_bit i2c_core power_supply dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]

Pid: 28091, comm: RT Tainted: G      D      3.0.9-rt26.45.el6rt.x86_64 #1 IBM BladeCenter LS21 -[7971AC1]-/Server Blade
RIP: 0010:[<ffffffff8139892f>]  [<ffffffff8139892f>] rt_spin_lock_slowlock+0x70/0x217
RSP: 0000:ffff880224939818  EFLAGS: 00010246
RAX: ffff880224b42700 RBX: ffffffff817b8960 RCX: 0000000000000001
RDX: ffffffff817b8978 RSI: ffff880224b42701 RDI: ffff880224b42701
RBP: ffff8802249398c8 R08: 00000000c283a9f6 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: ffff880224b42700
R13: ffff880224b42700 R14: 0000000000000006 R15: ffff8802249398f8
FS:  00007f313ce7a700(0000) GS:ffff88022fc00000(0063) knlGS:00000000a29e7b70
CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 00000000a3722000 CR3: 000000022a1e3000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process RT (pid: 28091, threadinfo ffff880224938000, task ffff880224b42700)
Stack:
 ffffffff817b535d ffffffff8155826c ffff880224939978 ffffffff817b535e
 ffff880224939898 ffffffff8139f92e ffff8802249398d8 0000000000000400
 0000000000000000 0000000000000000 0000000000000000 00000000c283a9f6
Call Trace:
 [<ffffffff8139f92e>] ? invalidate_interrupt17+0xe/0x20
 [<ffffffff81398ff6>] rt_spin_lock+0x24/0x26
 [<ffffffff8106c2ae>] exit_pi_state_list+0x8a/0x144
 [<ffffffff8103d630>] mm_release+0x73/0xeb
 [<ffffffff81041b6e>] exit_mm+0x26/0x137
 [<ffffffff810433a7>] do_exit+0x277/0x7e3
 [<ffffffff8103febe>] ? kmsg_dump+0xe3/0xf2
 [<ffffffff8139a2e6>] oops_end+0xb6/0xbe
 [<ffffffff81004c56>] die+0x5a/0x63
 [<ffffffff81399d49>] do_trap+0x121/0x130
 [<ffffffff81002f7f>] do_invalid_op+0x96/0x9f
 [<ffffffff8106f5ef>] ? task_blocks_on_rt_mutex+0x76/0x211
 [<ffffffff810a630d>] ? trace_nowake_buffer_unlock_commit+0xc/0xe
 [<ffffffff81030222>] ? ftrace_raw_event_sched_switch+0xeb/0xfa
 [<ffffffff8100188c>] ? __switch_to+0x25c/0x26a
 [<ffffffff813993d5>] ? _raw_spin_unlock_irq+0x24/0x2f
 [<ffffffff81038c0d>] ? finish_task_switch+0x52/0xab
 [<ffffffff8139fc95>] invalid_op+0x15/0x20
 [<ffffffff8106f5ef>] ? task_blocks_on_rt_mutex+0x76/0x211
 [<ffffffff8106f5af>] ? task_blocks_on_rt_mutex+0x36/0x211
 [<ffffffff8106f7df>] rt_mutex_start_proxy_lock+0x55/0x8a
 [<ffffffff8106dd6d>] futex_requeue+0x498/0x607
 [<ffffffff810dd319>] ? pagefault_enable+0x1d/0x1f
 [<ffffffff8106e733>] do_futex+0x857/0x8a1
 [<ffffffff810a2560>] ? rb_reserve_next_event+0x73/0x23f
 [<ffffffff81038c0d>] ? finish_task_switch+0x52/0xab
 [<ffffffff810a62f8>] ? __trace_buffer_unlock_commit+0x3c/0x45
 [<ffffffff810a630d>] ? trace_nowake_buffer_unlock_commit+0xc/0xe
 [<ffffffff8106e9e4>] compat_sys_futex+0x128/0x140
 [<ffffffff813a03b3>] ia32_do_call+0x13/0x13
Code: 48 89 43 10 31 d2 b9 01 00 00 00 4c 89 e6 48 89 df e8 df 63 cd ff 85 c0 0f 85 9a 01 00 00 48 8b 43 18 48 83 e0 fe 49 39 c4 75 04 <0f> 0b eb fe 4d 8d ac 24 00 06 00 00 4c 8d bd 60 ff ff ff 65 4c 
RIP  [<ffffffff8139892f>] rt_spin_lock_slowlock+0x70/0x217
 RSP <ffff880224939818>


NMI SOL output - oak test job

 complete output redirected to console after initiating NMI signal to blade

Hmm, might be worth testing the patch that we just got from Steven Rostedt for futex isuses. This is the crash we were hitting during our testing on the AMD box. Looks like the same issue. 

This BUG is not specific to AMD/Intel.

Tests were successful on 2 separate machine runs this morning after applying kernel patch.   

We are currently testing the MRG 2.1 GA kernel with Steven's patch for the pi-requeue bug fix. 

Still looks like Steven's fix fixes the original problem that was causing the initial hang that the LTC had also run into. Latest issue seems to be a different problem at first glance. 

Nivedita, 
It appears so but I am unable to expertly quantify.   300 threads is mentioned
in the last log prior to crash:

j9build@hammervm29 /team/ibaron/HRT $ tail -20 outFix12
> 15:29:34 20120203 15:29:33 Variable Processor INFO: Read variable from input file: threads = 300
j> 15:29:34 20120203 15:29:33 Variable Processor INFO: Overriding variable
duration from system property with value 300
j> 15:29:34 20120203 15:29:33 Variable Processor INFO: Replacing variable
$duration in Data node jltf.workstack.jobset.duration, value = 300
j> 15:29:34 20120203 15:29:33 Variable Processor INFO: Replacing variable
$threads in Data node
jltf.workstack.jobset.simpledriver.threadlevel.constantinteger.amp, value = 300
j> 15:29:34 20120203 15:29:33 Launcher INFO: Running tests
j> 15:29:34 20120203 15:29:33 Plugin Loader INFO: Loading Plugins
j> 15:29:34 20120203 15:29:33 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.workunitadapters.JUnitWorkUnitPlugin
j> 15:29:34 20120203 15:29:33 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.threads.RunOnceThreadPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.threads.RTThreadPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.SimpleDriverPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.logging.LoggingPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.functions.ConstantIntegerFunctionPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.threads.InternalThreadManagerPlugin
j> 15:29:34 20120203 15:29:34 Plugin Loader INFO: Loading
com.ibm.jtc.test.jltf.threads.PooledRTThreadPlugin
j> 15:29:34 20120203 15:29:34 Pooled Work Thread Factory INFO: Thread Pool
Exhaustion Behaviour: Grow Pool
j> 15:29:34 20120203 15:29:34 ConfigurationReporter INFO: Change detected
j> 15:29:34 20120203 15:29:34 Execution Timer INFO: Execution Timer started
j> 15:29:34 20120203 15:29:34 Thread Control Engine INFO: Starting JobSet
Generated
j> 15:30:05 20120203 15:30:04 Runtime State Reporter IMPORTANT: 15771 tests
complete, 152 currently running
j> 15:30:35 20120203 15:30:34 Runtime State Reporter IMPORTANT: 31053 tests
complete, 157 currently running

Karl,

We have a few questions and recommendations for next steps. 

o Can you confirm that the system had locked up completely? No pings, no ssh, no sysrq trigger working, etc.? 

o Can you get a crash dump on the standard RHEL 6.2 kernel?  If not, let's make it so that we can. 

o Can you run the same Java test that triggers the bug on the std RHEL 6.2 kernel? You only need to reboot to it and restart the tests. I know a few tests only run on Real-Time Linux, but hoping the triggering test case can be run on the std kernel. Hopefully we will have the data point that it's not an RT issue or at least get a dump that allows us to do some more investigation. 

Also, could the JTC confirm that disk space was not full?  If this is occurring with the jltf benchmark, a frequent occurrence apparently was that the system would fill up the file system (not that a hard hang is acceptable). But that might present some of the same symptoms.

Comment 1 IBM Bug Proxy 2012-02-04 02:01:04 UTC

Created attachment 559387 [details]
NMI SOL output - oak test job

Comment 2 IBM Bug Proxy 2012-02-04 02:01:08 UTC

Created attachment 559388 [details]
NMI SOL output - oak test job Extended Run Feb 3rd RTJAVM4 (LS21)

Comment 3 Clark Williams 2012-02-06 15:28:22 UTC

Was the above backtrace generated from the GA kernel or from one with the futex requeue patch Steven Rostedt generated?

Comment 4 Steven Rostedt 2012-02-06 15:33:33 UTC

This definitely looks like the same bug as: 784733

With the patches I posted to that BZ, this bug should not occur. I'm pretty positive about that (looking at the backtrace and the bug location).

Comment 5 IBM Bug Proxy 2012-02-06 17:13:26 UTC

------- Comment From niv.com 2012-02-06 12:04 EDT-------
Clark,

The first problem the JTC hit was the same as ours (that was on the original MRG 2.1 .45 GA kernel). After we installed the patched kernel with Steven's fix, the testing did not run into that problem (so more evidence that problem has been solved). it ran one successful run of 10 iterations (fairly long test)  which it wasn't able to do before.  However, when a longer test was scheduled it seems to have hit another problem.

We're currently speculating that it's in the dcache lookup path (see the trace from the extended testting.  Paul McKenney looked at it and has generated a patch (basically detect if we're spinning on this) and we are going to give that a shot.

So there are two separate problems here, and perhaps we should open a separate bug and declare this one closed as the first fix did indeed fix the first problem.

Attaching the patch from Paul here soon and more updates here later.

Comment 6 IBM Bug Proxy 2012-02-06 21:50:36 UTC

------- Comment From kravetz.com 2012-02-06 16:47 EDT-------
Going forward with debug, the following will be attempted:

- JTC will start the process of allowing LTC access to test systems through the BSOs
- When hitting this issue, JTC will attempt magic sysrc "t" command to get stack trace of all processes
- JTC will test with Paul McKenney's experimental patch
- Mike K will build new RPMs with Paul's patch

Comment 7 IBM Bug Proxy 2012-02-06 22:00:49 UTC

Created attachment 559762 [details]
Paul's Speculative Patch


------- Comment (attachment only) From kravetz.com 2012-02-06 16:50 EDT-------

Comment 8 IBM Bug Proxy 2012-02-06 22:10:34 UTC

------- Comment From kravetz.com 2012-02-06 17:04 EDT-------
RPMs that include Paul's patch (as well as previous pi-requeue) patch are located at:

http://kernel.beaverton.ibm.com/realtime/test_rpms2/

RPMs to be installed are:
- kernel-rt-3.0.9-rt26.45.el6rt.prtest2.x86_64.rpm
- kernel-rt-firmware-3.0.9-rt26.45.el6rt.prtest2.noarch.rpm

If Paul's patch hits the speculated condition, the following message will be put in the system log:

"d_lookup: write_seqlock forced !!!"

Comment 9 IBM Bug Proxy 2012-02-07 15:31:02 UTC

Created attachment 559992 [details]
messages rtjavm4 Feb 6 2012


------- Comment on attachment From Karl.Koelbel.com 2012-02-07 10:21 EDT-------


messages file

Comment 10 IBM Bug Proxy 2012-02-07 16:41:46 UTC

------- Comment From niv.com 2012-02-07 11:37 EDT-------
Update from Iris on JTC testing.

Testing updates:
1. JLTF mauve  finished 40 runs successfully with HRT mode on 2 AMD
machines with the first patch. This means that the problem is not multiple
threads, but the RTSJ test.
2. I ran my local oak version on rtjivm5 but didn't see the hang after 25
runs - might be that my test is not updated.
3 . Karl updated all 3 machines with the new patch and I kicked new JLTF
oak batches on all the 3 machines .

Comment 11 IBM Bug Proxy 2012-02-08 17:31:29 UTC

------- Comment From kravetz.com 2012-02-08 12:25 EDT-------
Hello Iris,

It appears the 3 systems above were running the 3.0.9-rt26.45.el6rt.prtest2.x86_64 kernel.  This kernel contains both the pi-requeue patch, and Paul McKenney's speculative dcache patch.

We would like to know if any of the systems experienced the condition addressed in Paul's patch.  If they did experience this condition, there would be a message containing the text below in the system log:

"d_lookup: write_seqlock forced !!!"

Can you please look for the above text in the system log of the 3 systems?  This will tell us whether or not the code introduced by Paul's patch was ever executed.

Neither of the above patches is an an official RedHat erratum kernel.  Progress is being made on the process of getting the pi-requeue patch into mainline and and then an erratum kernel.  At this time, we do not know if Paul's patch actually addresses a real issue.  Therefore, your feedback about messages in the system logs is necessary for continued analysis and evaluation of Paul's patch.

Comment 12 IBM Bug Proxy 2012-02-08 21:21:37 UTC

------- Comment From kravetz.com 2012-02-08 16:13 EDT-------
Hello Iris,

I assume you want an 'official' early version of the erratum kernel from RedHat, instead of the RPMs I have been building for you?

If so, we will need someone from RedHat to comment on possible availability.

Comment 13 IBM Bug Proxy 2012-02-08 22:01:11 UTC

------- Comment From niv.com 2012-02-08 16:58 EDT-------
I talked to Clark Williams just now.

He's updated the repository with the .50 rpms. This version has the fix from Steven Rostedt for the pi requeue issue. It does not have Paul McKenney's diagnostic/speculative patch.

RPMS are here:
http://people.redhat.com/williams/.ibm/

Please download and use for MRG 2.1 testing.   I'll also back-up on kbic.

Comment 14 IBM Bug Proxy 2012-02-15 23:41:38 UTC

------- Comment From niv.com 2012-02-15 18:32 EDT-------
Note, there have been multiple bugs that have occurred in this report.

1. The first issue that was run into was a DUP of LTC #78072 (RH784733).
-- This was a hang in the pi requeue code that was fixed by Steven Rostedt's
patch.
-- This patch is now upstream (mainline RT) as well as in Red Hat MRG 2.1
errata kernel to be released later this month.
-- This bug is now being used to track the other issues being hit.

2. The subsequent testing on the RH updated kernel rpm (MRG 2.1, .50 kernel) issues:
- first issue seems to be a hang in dcache code path
- Paul McKenney has provided a patch to resolve (this is a speculative patch)

- testing is running without bandwidth limiting, and the hang is not seen with
bandwidth limiting enabled.


CURRENT SUMMARY
=======================
- testing is hitting a problem with no bandwidth limiting (on .50 kernel, without Paul's patch)
- testing with Paul's patch = ?

INFO NEEDED
================
o Iris, when you're updating this bug, please add the following information
when appropriate:
- which kernel were you doing the testing on (output of "uname -a")
o Iris, can you update this bug with whether you have run with Paul's patch or
not? I believe Mike Kravetz had built an rpm with that patch for you. Apologies
if I missed seeing that in this comment thread.

Comment 15 IBM Bug Proxy 2012-02-17 00:51:18 UTC

------- Comment From kravetz.com 2012-02-16 19:48 EDT-------
Ok, so the summary above states that systems with Paul's Patch and RT bandwidth limiting disabled.  Therefore, the next step appears to be debug of the system when it is in the hang state.

One thing that 'may' help is code that will list a stack traceback of all tasks in the system when the hang occurs.  Normally, this would be possible via magic sysreq T.  But, magic sysreq is even blocked in this hang state.  My idea would be to generate this list of all stack tracebacks when a NMI is delivered to the system.  This will require a new debug kernel.  I can start work on this type of debug code.

Do people think this would be useful?

Comment 16 Clark Williams 2012-02-17 18:54:50 UTC

Sounds useful to me. Basically do a sysrq-t then panic?

Comment 17 IBM Bug Proxy 2012-02-17 21:51:09 UTC

------- Comment From kravetz.com 2012-02-17 16:43 EDT-------
There is a work around for the kdump issue.  So, next step is to get a dump when the system hangs.

Follow the instructions at the following location to setup kdump:
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_MRG/2/html/Realtime_Tuning_Guide/sect-Realtime_Tuning_Guide-Realtime_Specific_Tuning-Using_kdump_and_kexec_with_the_RT_kernel.html

However in the section "Enabling kdump with grub.conf", perform the following steps BEFORE rebooting the system:
- Obtain the value of the KDUMP_KERNELVER variable in the file /etc/sysconfig/kdump
- Find the initramfs of the kdump kernel (KDUMP_KERNELVER) in the boot filesystem
As an example initramfs-2.6.32-220.4.1.el6.x86_64.img
- Make a copy of the initramfs, but add the string "kdump" before ".img"
For example cp initramfs-2.6.32-220.4.1.el6.x86_64.img initramfs-2.6.32-220.4.1.el6.x86_64kdump.img
- Make sure the line "kernel.unknown_nmi_panic = 1" exists in the file /etc/sysctl.conf

reboot and continue with the instructions in the document.  Be sure to test with "echo "c" > /proc/sysrq-trigger"

Now, rerun the tests and when the system hangs send a NMI.  A vmcore file should be produced in /var/crash.  We can use this to obtain more information about the state of the system at the time of the hang.

Comment 18 IBM Bug Proxy 2012-02-17 23:11:15 UTC

------- Comment From niv.com 2012-02-17 18:02 EDT-------
There were changes to the /etc/rtgroups between MRG 2.1 and previous MRG releases as well. Those could impact behavior, too.

Good data points:

o Results with throttling data points on previous MRG versions --> indicate the change in the priority settings might be responsible.
o Results with the kernel resources bumped up above Java priorities (I'll attach an updated /etc/rtgroups file)

The latter test would be the simplest, easiest next step to take.

o Would be good to collect "ps -eLo pid,cmd,rtprio,policy" output during the test (before it hangs, naturally)

Comment 19 IBM Bug Proxy 2012-02-23 23:41:15 UTC

------- Comment From kravetz.com 2012-02-23 18:31 EDT-------
I have examined a couple vmcore files produced by forcing a panic via NMI when the system hangs.  Here are the stacks running on the four CPUs:

CPU 0
---------
#7 [ffff880222bd5da8] d_lookup at ffffffff811165d7
#8 [ffff880222bd5de0] d_hash_and_lookup at ffffffff811166a9
#9 [ffff880222bd5e00] proc_flush_task at ffffffff811538b1
#10 [ffff880222bd5e90] release_task at ffffffff810422d4
#11 [ffff880222bd5ef0] do_exit at ffffffff81043872

--------
#7 [ffff88021118dda8] d_lookup at ffffffff811165df
#8 [ffff88021118dde0] d_hash_and_lookup at ffffffff811166a9
#9 [ffff88021118de00] proc_flush_task at ffffffff811538b1
#10 [ffff88021118de90] release_task at ffffffff810422d4
#11 [ffff88021118def0] do_exit at ffffffff81043872

CPU 2
--------
#8 [ffff88012a46bd28] __handle_sysrq at ffffffff8123a5c3
#9 [ffff88012a46bd58] handle_sysrq at ffffffff8123a844
#10 [ffff88012a46bd68] serial8250_handle_port at ffffffff8124b9ce
#11 [ffff88012a46bdc8] serial8250_interrupt at ffffffff8124bb94
#12 [ffff88012a46be08] irq_forced_thread_fn at ffffffff81093083
#13 [ffff88012a46be38] irq_thread at ffffffff81092ed8
#14 [ffff88012a46beb8] kthread at ffffffff8105c5f9
#15 [ffff88012a46bf48] kernel_thread_helper at ffffffff8139fe94

CPU 3
--------
#7 [ffff880222273cc8] d_lookup at ffffffff811165df
#8 [ffff880222273d00] __lookup_hash at ffffffff8110c46a
#9 [ffff880222273d40] do_last at ffffffff8110e236
#10 [ffff880222273da0] path_openat at ffffffff8110eb60
#11 [ffff880222273e30] do_filp_open at ffffffff8110ee9d
#12 [ffff880222273f10] do_sys_open at ffffffff81101628
#13 [ffff880222273f70] compat_sys_open at ffffffff8113d83c
#14 [ffff880222273f80] cstar_dispatch at ffffffff813a0229

The CPUs spinning in d_lookup() are actually spinning in the routine __read_seqcount_begin() which looks like:

static inline unsigned __read_seqcount_begin(const seqcount_t *s)
{
unsigned ret;

repeat:
ret = s->sequence;
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
return ret;
}

The seqcount_t used by d_lookup() is rename_lock.  We can examine the value of rename_lock:

crash> *seqcount rename_lock
struct seqcount {
sequence = 96951
}

Note that the sequence count is odd which means that someone has taken the lock for writing.  Unfortunately, the task holding the lock for writing was preempted.  And, it is unlikely to run again as the CPUs are being consumed by RT tasks spinning in __read_seqcount_begin().

Paul McKenney previously created a patch because he suspected this was the issue.  Unfortunately, Paul's patch was not quite correct and looked for tasks spinning in d_lookup() instead of the locking primitive.  Paul will provide a new version of his patch.

Comment 20 IBM Bug Proxy 2012-02-23 23:41:21 UTC

Created attachment 565412 [details]
Paul's Speculative Patch V2


------- Comment on attachment From kravetz.com 2012-02-23 18:34 EDT-------


An updated version of Paul's patch

Comment 21 IBM Bug Proxy 2012-02-27 21:45:16 UTC

------- Comment From kravetz.com 2012-02-27 16:34 EDT-------
Stack traces from one of the recent vmcores:

CPU 0
-----
#5 [ffff88041b08fa48] rt_spin_unlock at ffffffff8139c316
#6 [ffff88041b08fa50] dentry_kill at ffffffff811150cc
#7 [ffff88041b08fa80] shrink_dentry_list at ffffffff811152c4
#8 [ffff88041b08fad0] __shrink_dcache_sb at ffffffff811154ad
#9 [ffff88041b08fb50] shrink_dcache_parent at ffffffff81116365
#10 [ffff88041b08fbd0] proc_flush_task at ffffffff81153b5b
#11 [ffff88041b08fc60] release_task at ffffffff81042418
#12 [ffff88041b08fcc0] do_exit at ffffffff810439d3
#13 [ffff88041b08fd40] do_group_exit at ffffffff81043afb
#14 [ffff88041b08fd70] get_signal_to_deliver at ffffffff810526ee
#15 [ffff88041b08fe20] do_signal at ffffffff81002095
#16 [ffff88041b08ff20] do_notify_resume at ffffffff810026da
#17 [ffff88041b08ff50] int_signal at ffffffff813a2448

-----
#7 [ffff88041c5e3a28] migrate_enable at ffffffff8103aefe
#8 [ffff88041c5e3a50] dentry_kill at ffffffff81115083
#9 [ffff88041c5e3a80] shrink_dentry_list at ffffffff811152c4
#10 [ffff88041c5e3ad0] __shrink_dcache_sb at ffffffff811154ad
#11 [ffff88041c5e3b50] shrink_dcache_parent at ffffffff81116365
#12 [ffff88041c5e3bd0] proc_flush_task at ffffffff81153b5b
#13 [ffff88041c5e3c60] release_task at ffffffff81042418
#14 [ffff88041c5e3cc0] do_exit at ffffffff810439d3
#15 [ffff88041c5e3d40] do_group_exit at ffffffff81043afb
#16 [ffff88041c5e3d70] get_signal_to_deliver at ffffffff810526ee
#17 [ffff88041c5e3e20] do_signal at ffffffff81002095
#18 [ffff88041c5e3f20] do_notify_resume at ffffffff810026da
#19 [ffff88041c5e3f50] int_signal at ffffffff813a2448

CPU 2
-----
#7 [ffff88041c5e7a20] unpin_current_cpu at ffffffff810419ae
#8 [ffff88041c5e7a20] migrate_enable at ffffffff8103affa
#9 [ffff88041c5e7a50] dentry_kill at ffffffff81115091
#10 [ffff88041c5e7a80] shrink_dentry_list at ffffffff811152c4
#11 [ffff88041c5e7ad0] __shrink_dcache_sb at ffffffff811154ad
#12 [ffff88041c5e7b50] shrink_dcache_parent at ffffffff81116365
#13 [ffff88041c5e7bd0] proc_flush_task at ffffffff81153b5b
#14 [ffff88041c5e7c60] release_task at ffffffff81042418
#15 [ffff88041c5e7cc0] do_exit at ffffffff810439d3
#16 [ffff88041c5e7d40] do_group_exit at ffffffff81043afb
#17 [ffff88041c5e7d70] get_signal_to_deliver at ffffffff810526ee
#18 [ffff88041c5e7e20] do_signal at ffffffff81002095
#19 [ffff88041c5e7f20] do_notify_resume at ffffffff810026da
#20 [ffff88041c5e7f50] int_signal at ffffffff813a2448

CPU 3
-----
#7 [ffff880233a4fa00] _raw_spin_lock at ffffffff8139c5d9
#8 [ffff880233a4fa00] rt_mutex_slowtrylock at ffffffff8106f232
#9 [ffff880233a4fa30] rt_mutex_trylock at ffffffff8139b9b9
#10 [ffff880233a4fa40] rt_spin_trylock at ffffffff8139c2ce
#11 [ffff880233a4fa50] dentry_kill at ffffffff811150b5
#12 [ffff880233a4fa80] shrink_dentry_list at ffffffff811152c4
#13 [ffff880233a4fad0] __shrink_dcache_sb at ffffffff811154ad
#14 [ffff880233a4fb50] shrink_dcache_parent at ffffffff81116365
#15 [ffff880233a4fbd0] proc_flush_task at ffffffff81153b5b
#16 [ffff880233a4fc60] release_task at ffffffff81042418
#17 [ffff880233a4fcc0] do_exit at ffffffff810439d3
#18 [ffff880233a4fd40] do_group_exit at ffffffff81043afb
#19 [ffff880233a4fd70] get_signal_to_deliver at ffffffff810526ee
#20 [ffff880233a4fe20] do_signal at ffffffff81002095
#21 [ffff880233a4ff20] do_notify_resume at ffffffff810026da
#22 [ffff880233a4ff50] int_signal at ffffffff813a2448

CPU 4
-----
#7 [ffff88041b08ba38] rt_mutex_trylock at ffffffff8139b9b9
#8 [ffff88041b08ba40] rt_spin_trylock at ffffffff8139c2ce
#9 [ffff88041b08ba50] dentry_kill at ffffffff811150b5
#10 [ffff88041b08ba80] shrink_dentry_list at ffffffff811152c4
#11 [ffff88041b08bad0] __shrink_dcache_sb at ffffffff811154ad
#12 [ffff88041b08bb50] shrink_dcache_parent at ffffffff81116365
#13 [ffff88041b08bbd0] proc_flush_task at ffffffff81153b5b
#14 [ffff88041b08bc60] release_task at ffffffff81042418
#15 [ffff88041b08bcc0] do_exit at ffffffff810439d3
#16 [ffff88041b08bd40] do_group_exit at ffffffff81043afb
#17 [ffff88041b08bd70] get_signal_to_deliver at ffffffff810526ee
#18 [ffff88041b08be20] do_signal at ffffffff81002095
#19 [ffff88041b08bf20] do_notify_resume at ffffffff810026da
#20 [ffff88041b08bf50] int_signal at ffffffff813a2448

CPU 5
-----
#7 [ffff8804178f3a58] dentry_kill at ffffffff8111507a
#8 [ffff8804178f3a80] shrink_dentry_list at ffffffff811152c4
#9 [ffff8804178f3ad0] __shrink_dcache_sb at ffffffff811154ad
#10 [ffff8804178f3b50] shrink_dcache_parent at ffffffff81116365
#11 [ffff8804178f3bd0] proc_flush_task at ffffffff81153b5b
#12 [ffff8804178f3c60] release_task at ffffffff81042418
#13 [ffff8804178f3cc0] do_exit at ffffffff810439d3
#14 [ffff8804178f3d40] do_group_exit at ffffffff81043afb
#15 [ffff8804178f3d70] get_signal_to_deliver at ffffffff810526ee
#16 [ffff8804178f3e20] do_signal at ffffffff81002095
#17 [ffff8804178f3f20] do_notify_resume at ffffffff810026da
#18 [ffff8804178f3f50] int_signal at ffffffff813a2448

CPU 6
-----
#7 [ffff88041ca7fa08] rt_mutex_slowtrylock at ffffffff8106f278
#8 [ffff88041ca7fa30] rt_mutex_trylock at ffffffff8139b9b9
#9 [ffff88041ca7fa40] rt_spin_trylock at ffffffff8139c2ce
#10 [ffff88041ca7fa50] dentry_kill at ffffffff811150b5
#11 [ffff88041ca7fa80] shrink_dentry_list at ffffffff811152c4
#12 [ffff88041ca7fad0] __shrink_dcache_sb at ffffffff811154ad
#13 [ffff88041ca7fb50] shrink_dcache_parent at ffffffff81116365
#14 [ffff88041ca7fbd0] proc_flush_task at ffffffff81153b5b
#15 [ffff88041ca7fc60] release_task at ffffffff81042418
#16 [ffff88041ca7fcc0] do_exit at ffffffff810439d3
#17 [ffff88041ca7fd40] do_group_exit at ffffffff81043afb
#18 [ffff88041ca7fd70] get_signal_to_deliver at ffffffff810526ee
#19 [ffff88041ca7fe20] do_signal at ffffffff81002095
#20 [ffff88041ca7ff20] do_notify_resume at ffffffff810026da
#21 [ffff88041ca7ff50] int_signal at ffffffff813a2448

CPU 7
-----
#7 [ffff8802f6a0fa28] migrate_disable at ffffffff8103b0d8
#8 [ffff8802f6a0fa50] dentry_kill at ffffffff81115071
#9 [ffff8802f6a0fa80] shrink_dentry_list at ffffffff811152c4
#10 [ffff8802f6a0fad0] __shrink_dcache_sb at ffffffff811154ad
#11 [ffff8802f6a0fb50] shrink_dcache_parent at ffffffff81116365
#12 [ffff8802f6a0fbd0] proc_flush_task at ffffffff81153b5b
#13 [ffff8802f6a0fc60] release_task at ffffffff81042418
#14 [ffff8802f6a0fcc0] do_exit at ffffffff810439d3
#15 [ffff8802f6a0fd40] do_group_exit at ffffffff81043afb
#16 [ffff8802f6a0fd70] get_signal_to_deliver at ffffffff810526ee
#17 [ffff8802f6a0fe20] do_signal at ffffffff81002095
#18 [ffff8802f6a0ff20] do_notify_resume at ffffffff810026da
#19 [ffff8802f6a0ff50] int_signal at ffffffff813a2448

Comment 22 IBM Bug Proxy 2012-02-28 01:01:22 UTC

------- Comment From kravetz.com 2012-02-27 19:57 EDT-------
Looks like this may be what we are hitting:

https://lkml.org/lkml/2012/1/10/125

Comment 23 Steven Rostedt 2012-03-06 20:05:28 UTC

Created attachment 568052 [details]
sync cpu down with notifiers (based against 3.2.9-rt16)

Comment 24 Steven Rostedt 2012-03-07 14:22:40 UTC

Created attachment 568305 [details]
sync cpu down with notifiers (based against 3.0.20-rt36)

Comment 25 Steven Rostedt 2012-03-07 14:24:20 UTC

Created attachment 568306 [details]
Revert workqueue patch 1 (3.0.20-rt36)

Comment 26 Steven Rostedt 2012-03-07 14:24:51 UTC

Created attachment 568307 [details]
Revert workqueue patch 2 (3.0.20-rt36)

Comment 27 Steven Rostedt 2012-03-07 14:33:58 UTC

The last three patches I added that was based on top of 3.0.20-rt36, with the other patches in the bugzillas: 799385 799386 799389 799391 799399

the problem should be fixed.

Note, the cpu.c changes that are made here are not liked upstream. The hotplug code really needs to be redesigned both in -rt and in mainline. But the patch does fix the issue for now while we come up with a better solution.

Comment 31 IBM Bug Proxy 2012-04-04 15:02:38 UTC

------- Comment From nivedita.com 2012-04-04 14:50 EDT-------
Given that the initial issue plus a later OS issue that was reported in this bug have been identified and fixed, I recommend closing this bug.

I recommend openening a new bug with information on the new/current issue.

Comment 32 Jiri Kastner 2012-04-04 15:07:18 UTC

(In reply to comment #31)
> ------- Comment From nivedita.com 2012-04-04 14:50 EDT-------
> Given that the initial issue plus a later OS issue that was reported in this
> bug have been identified and fixed, I recommend closing this bug.
> 
> I recommend openening a new bug with information on the new/current issue.

verified by customer

Comment 33 Clark Williams 2012-04-11 15:19:54 UTC

    Technical note added. If any revisions are required, please edit the "Technical Notes" field
    accordingly. All revisions will be proofread by the Engineering Content Services team.
    
    New Contents:
* Cause: realtime workqueue patch introduced races conditions in hotplug
* Consequence: system hang in cpu hotplug code
* Fix: revert (unneeded) pair of workqueue patches
* Result: no system hang

Comment 34 errata-xmlrpc 2012-04-18 20:03:56 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHBA-2012-0496.html

Comment 35 Luis Claudio R. Goncalves 2012-05-22 15:50:02 UTC

*** Bug 795477 has been marked as a duplicate of this bug. ***