867433 – system lock up or NMI during console stress

Bug 867433 - system lock up or NMI during console stress

Summary: system lock up or NMI during console stress

Keywords:
Status:	CLOSED CANTFIX
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	17
Hardware:	x86_64
OS:	Linux
Priority:	unspecified
Severity:	urgent
Target Milestone:	---
Assignee:	Neil Horman
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2012-10-17 13:54 UTC by Vasu Dev
Modified:	2012-11-30 19:57 UTC (History)
CC List:	8 users (show)
Fixed In Version:
Doc Type:	Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed:	2012-11-30 19:57:52 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
test module and RHEL log before lock up (192.25 KB, application/x-compressed) 2012-10-17 13:54 UTC, Vasu Dev	no flags	Details
patch to reduce the async thread count to be sane (1.52 KB, patch) 2012-11-14 20:51 UTC, Neil Horman	no flags	Details \| Diff
View All

Description Vasu Dev 2012-10-17 13:54:05 UTC

Created attachment 628827 [details]
test module and RHEL log before lock up

Description of problem:

The system locks up while console is flooded with attached test module. The test module simply creates per CPU kernel thread and binds that to CPU. The each thread writes just single liner debug messages for tunable number of counts having default 1000000 times and then sleeps for 100 seconds. The print count and sleep period can be tuned by its module parameter test_pr_count and test_pr_sleep though defaults works fine to trigger the system lock or NMI.


Version-Release number of selected component (if applicable):

   Fedora 17 3.3.4-5.fc17.x86_64 and RHEL 6.3 2.6.32-279.9.1.el6.x86_64

How reproducible:

   On each runs takes less than an hour on these platforms:-

    1. Dell PowerEdge R710, Dual Hexa-Core CPU w/ HT (12 physical cores, 24 threads) (Xeon X5650), 12GB, 4 x 160GB Raid 5 Virtual Disk, BIOS Version 6.1.0, System ID 16664

   2. Dell PowerEdge R710, Dual Hexa-Core CPU w/ HT (12 physical cores, 24 threads) (Xeon X5650), 12GB, 4 x 160GB Raid 5 Virtual Disk, BIOS Version 6.1.0, System ID 16664

Steps to Reproduce:
1. build attached test module
2. load the test module and issue would occur in less than an hour
  
Actual results:

Either system locks up or NMI.

Expected results:

Console flooding should n't cause system lock up or NMI.

Additional info:

Originally issue was reported having 256 FCoE luns and then doing cable pull test, the cable was pulled for few seconds and after each re-insert FCoE luns were rediscovered. The SCSI prints for each lun on console caused system lock up or NMI warning. I added debug code to SCSI scan code and with that ruled out if any locking or interrupt disabling there before console printk could be the cause. Noticed just in printk CPU getting hard lock up as in attached trace below. This led me to write attached test module just doing console prints and with that issues is triggered and ruled out scsi scan for sure.
 
A similar issue was investigated in RHEL6.2/6.3, so tried same test module on RHEL6.3. I could not trigger the issue there with out of box but once I set watchdog threshold same as in FC17 with 10 seconds value then the same issue occurred on RHEL6.3 also. I think original issue on RHEL was not really fixed but instead was masked by increased watchdog threshold to 60 second. The CPU during SCSI scans are getting stuck for several seconds and that could cause other issues as loosing FCoE connection while scanning.

Noticed NMI warn along with IRQ rebalancing load, so tried test after stopping irq balance service and after unloading all network driver to reduce any network load factor. However that didn't help.  


NMI trace with CPU hardlock up.

Oct  4 07:29:47 liniop6 kernel: [  475.398521] WARNING: at kernel/watchdog.c:395 watchdog_overflow_callback+0xb3/0xe0()
Oct  4 07:29:47 liniop6 kernel: [  475.398521] Hardware name: PowerEdge R710
Oct  4 07:29:47 liniop6 kernel: [  475.398522] Watchdog detected hard LOCKUP on cpu 5
Oct  4 07:29:47 liniop6 kernel: [  475.398536] Modules linked in: bnx2fc cnic uio fcoe libfcoe libfc scsi_transport_fc scsi_tgt netconsole lockd sunrpc 8021q garp stp llc acpi_power_meter coretemp ixgbe kvm iTCO_wdt iTCO_vendor_support lpc_ich i7core_edac edac_core bnx2 mdio crc32c_intel serio_raw ghash_clmulni_intel dca ses joydev mfd_core dcdbas pcspkr enclosure microcode megaraid_sas [last unloaded: netconsole]
Oct  4 07:29:47 liniop6 kernel: [  475.398538] Pid: 1825, comm: kworker/u:4 Not tainted 3.6.0-rc4-upstream-linux-fcoe-queue-x86_64-gd5aa732+ #9
Oct  4 07:29:47 liniop6 kernel: [  475.398539] Call Trace:
Oct  4 07:29:47 liniop6 kernel: [  475.398543]  <NMI>  [<ffffffff8105d29f>] warn_slowpath_common+0x7f/0xc0
Oct  4 07:29:47 liniop6 kernel: [  475.398545]  [<ffffffff8105d396>] warn_slowpath_fmt+0x46/0x50
Oct  4 07:29:47 liniop6 kernel: [  475.398546]  [<ffffffff810ec3a3>] watchdog_overflow_callback+0xb3/0xe0
Oct  4 07:29:47 liniop6 kernel: [  475.398548]  [<ffffffff8112694d>] __perf_event_overflow+0x9d/0x240
Oct  4 07:29:47 liniop6 kernel: [  475.398551]  [<ffffffff810256d7>] ? x86_perf_event_set_period+0xd7/0x160
Oct  4 07:29:47 liniop6 kernel: [  475.398553]  [<ffffffff81127624>] perf_event_overflow+0x14/0x20
Oct  4 07:29:47 liniop6 kernel: [  475.398555]  [<ffffffff8102acb3>] intel_pmu_handle_irq+0x193/0x310
Oct  4 07:29:47 liniop6 kernel: [  475.398558]  [<ffffffff8165b32d>] perf_event_nmi_handler+0x1d/0x20
Oct  4 07:29:47 liniop6 kernel: [  475.398560]  [<ffffffff8165aa79>] nmi_handle.isra.0+0x69/0xa0
Oct  4 07:29:47 liniop6 kernel: [  475.398561]  [<ffffffff8165ab90>] do_nmi+0xe0/0x350
Oct  4 07:29:47 liniop6 kernel: [  475.398563]  [<ffffffff8165a080>] end_repeat_nmi+0x1e/0x2e
Oct  4 07:29:47 liniop6 kernel: [  475.398566]  [<ffffffff8135de49>] ? cfb_imageblit+0x4b9/0x4f0
Oct  4 07:29:47 liniop6 kernel: [  475.398567]  [<ffffffff8135de49>] ? cfb_imageblit+0x4b9/0x4f0
Oct  4 07:29:47 liniop6 kernel: [  475.398568]  [<ffffffff8135de49>] ? cfb_imageblit+0x4b9/0x4f0
Oct  4 07:29:47 liniop6 kernel: [  475.398572]  <<EOE>>  [<ffffffff815486d3>] ? dev_hard_start_xmit+0x3c3/0x700
Oct  4 07:29:47 liniop6 kernel: [  475.398575]  [<ffffffff81356f55>] bit_putcs+0x315/0x580
Oct  4 07:29:47 liniop6 kernel: [  475.398577]  [<ffffffff81357bc7>] ? soft_cursor+0x1d7/0x280
Oct  4 07:29:47 liniop6 kernel: [  475.398579]  [<ffffffff813512e3>] ? get_color.isra.14+0x43/0x160
Oct  4 07:29:47 liniop6 kernel: [  475.398580]  [<ffffffff81352f23>] fbcon_putcs+0x133/0x170
Oct  4 07:29:47 liniop6 kernel: [  475.398582]  [<ffffffff81356c40>] ? bit_clear_margins+0x190/0x190
Oct  4 07:29:47 liniop6 kernel: [  475.398584]  [<ffffffff8135324f>] fbcon_redraw.isra.21+0x18f/0x200
Oct  4 07:29:47 liniop6 kernel: [  475.398586]  [<ffffffff81354e5b>] fbcon_scroll+0x1db/0xd10
Oct  4 07:29:47 liniop6 kernel: [  475.398589]  [<ffffffff813c0e23>] scrup+0x113/0x120
Oct  4 07:29:47 liniop6 kernel: [  475.398590]  [<ffffffff813c0ed0>] lf+0x70/0x80
Oct  4 07:29:47 liniop6 kernel: [  475.398591]  [<ffffffff813c1bd4>] vt_console_print+0x284/0x3b0
Oct  4 07:29:47 liniop6 kernel: [  475.398593]  [<ffffffff8105e5f9>] call_console_drivers.constprop.14+0x99/0x120
Oct  4 07:29:47 liniop6 kernel: [  475.398595]  [<ffffffff8105f492>] console_unlock+0x412/0x450
Oct  4 07:29:47 liniop6 kernel: [  475.398596]  [<ffffffff8105f886>] vprintk_emit+0x286/0x520
Oct  4 07:29:47 liniop6 kernel: [  475.398599]  [<ffffffff8164e926>] printk_emit+0x31/0x33
Oct  4 07:29:47 liniop6 kernel: [  475.398602]  [<ffffffff813e7ae2>] __dev_printk+0x152/0x280
Oct  4 07:29:47 liniop6 kernel: [  475.398604]  [<ffffffff814175bc>] ? sd_revalidate_disk+0x30c/0x1f60
Oct  4 07:29:47 liniop6 kernel: [  475.398605]  [<ffffffff813e7f65>] dev_printk+0x45/0x50
Oct  4 07:29:47 liniop6 kernel: [  475.398607]  [<ffffffff810ebc04>] ? nmi_dbg_scsi+0x74/0xe0
Oct  4 07:29:47 liniop6 kernel: [  475.398608]  [<ffffffff8141942c>] sd_probe_async+0x21c/0x290
Oct  4 07:29:47 liniop6 kernel: [  475.398611]  [<ffffffff81089b98>] async_run_entry_fn+0x88/0x190
Oct  4 07:29:47 liniop6 kernel: [  475.398614]  [<ffffffff8107a94f>] process_one_work+0x13f/0x4d0
Oct  4 07:29:47 liniop6 kernel: [  475.398616]  [<ffffffff81089b10>] ? async_unregister_domain+0x80/0x80
Oct  4 07:29:47 liniop6 kernel: [  475.398618]  [<ffffffff8107c29e>] worker_thread+0x15e/0x440
Oct  4 07:29:47 liniop6 kernel: [  475.398619]  [<ffffffff8107c140>] ? manage_workers.isra.20+0x290/0x290
Oct  4 07:29:47 liniop6 kernel: [  475.398621]  [<ffffffff81081a83>] kthread+0x93/0xa0
Oct  4 07:29:47 liniop6 kernel: [  475.398623]  [<ffffffff81662a84>] kernel_thread_helper+0x4/0x10
Oct  4 07:29:47 liniop6 kernel: [  475.398625]  [<ffffffff810819f0>] ? __init_kthread_worker+0x40/0x40
Oct  4 07:29:47 liniop6 kernel: [  475.398626]  [<ffffffff81662a80>] ? gs_change+0x13/0x13

Comment 1 Josh Boyer 2012-10-17 15:26:05 UTC

3.3.4-5 is a very old kernel at this point, and the backtrace is from a custom-built upstream kernel.  Given you're clearly working with upstream source, did you report this bug to the upstream developers?

Comment 2 Vasu Dev 2012-10-17 20:27:30 UTC

I started looking into  issue of NMI during 256 FCoE luns and looked into all three RHEL6.3 GA, FC17 and upstream kernel based on FC17 config. The FC17 and RHEL6.3 completely locks up and no useful info before lockup while upstream only has NMI WARN as attached above in the trace. The upstream trace can be use as data point to root cause FC17 and RHEL6.3 lock up and so far I've isolated FCoE/SCSI/network, that suggests likely real root cause and fix would be in NMI or console driver, so someone more familiar with these subsystems could be of quick help here and can pursue with any of these config, anyone @rehat owner/expert for those area ?

Also trying attached test module is very quick, so Josh can you give quick try on latest FC17 kernel, it should give another data point in case lock-up is unique to my HW setup though I had two HW flavor with details attached to the bug ?

Comment 3 Vasu Dev 2012-10-29 21:56:29 UTC

Any response to my comment #2 ?

Related RHEL6.x BZ https://bugzilla.redhat.com/show_bug.cgi?id=647041, I believe RHEL worked around this issue with increased watchdog threshold to 60, since as I mentioned before that lock up occurs with RHEL6.3 also once watchdog threshold set to 10 as in FC17, so real underlying issue exist for quite some time.

Comment 4 Neil Horman 2012-10-31 16:21:30 UTC

Vasu, also, can you try this without the netconsole module?  Thats often a path that responsible for lockups.

Comment 5 Vasu Dev 2012-11-01 17:20:18 UTC

(In reply to comment #4)
> Vasu, also, can you try this without the netconsole module?  Thats often a
> path that responsible for lockups.

I also suspected that and therefore I had it disabled in my test. 

I started with FCoE on this issue and then isolatted SCSI and networking in my debugging as I described in the bug description. The attached test module triggeres the issues w/o networking module loaded. This isolates entire networking subsystems including netconsole.

Comment 6 Neil Horman 2012-11-01 20:31:26 UTC

thank you vasu, thats very helpful.  I'll use that to reproduce here shortly.

Comment 7 Neil Horman 2012-11-02 19:27:26 UTC

I've used the module to reproduce the problem.  Not sure what to do about it at this point though.  Thinking....

Comment 8 Neil Horman 2012-11-07 19:22:49 UTC

hmm, something just occured to me, looking at this code.  The test case is running _much_ faster than I would expect it to, and its not printing everything it should be.  investigating....

Comment 9 Neil Horman 2012-11-07 20:30:24 UTC

Ok, found the problem - and its code that you simply can't run in the kernel.  I noted your pr_count variables was decremented in a while loop. If you create a process renice it to be the highest priority task in the system, and never yield the cpu in the kernel and do nothing but issue printks, the kernel will do that to the exclusion of all other tasks in the system.  And if you do so on every cpu, deadlocks are inevitible.  It might cause the same symptom as what you see in fcoe, but its not really a fixable issue.  More specifically the fix may be simply to ratelimit the printk with the __ratelimit function.  I presume that you're see the sd_revalidate_disk printk when this problem occurs?

Comment 10 Neil Horman 2012-11-09 20:19:46 UTC

yeah, the more I look at this the more I think its not really a bug, at least not one thats worth fixing.  The test case here busys every cpu on the system with a high priority thread that never yields the cpu, so we basically are guaranteed to loose responsivness in user space immediately.  And flooding the console is something that is consistently handled upstream with ratelimiting (as flooding the console isn't useful).  Additionally if this is a problem with constant printouts from sd_revalidate_disk, then we've either found a call that needs ratelimiting, or we've got debugging turned on and created a pessimal situation which will never be encountered with production systems (i.e. in this case we would need to reduce scsi_logging_level to reduce console stress).

Comment 11 Vasu Dev 2012-11-12 23:04:23 UTC

(In reply to comment #10)
> yeah, the more I look at this the more I think its not really a bug, at
> least not one thats worth fixing.  The test case here busys every cpu on the
> system with a high priority thread that never yields the cpu,

Test thread does yields CPU as:

1) it simply writes prints in thread context thus interrupts, soft IRQ and other threads becoming higher priorty due to long delay would always preepmt the test thread.

2) the test threads sleeps for 100 seconds after each print cycle, that is really long period to allow other to finish doing job while no other load with all networking disabled.


Its okay to be not responsive for time being but I don't see any way the test thread its self could result in masking timer interrupt or soft IRQ and then resulting in NMI or soft lockup for above reason, but indirectly these prints from the threads probably hitting a console bug causing NMI or system lockup.

> so we
> basically are guaranteed to loose responsivness in user space immediately. 

Thats fine, if you do insmod followed by rmmod after some delay then you would expect system to comeback but it doesn't most of the time as that is the issue.

> And flooding the console is something that is consistently handled upstream
> with ratelimiting (as flooding the console isn't useful).  Additionally if
> this is a problem with constant printouts from sd_revalidate_disk, then
> we've either found a call that needs ratelimiting, or we've got debugging
> turned on and created a pessimal situation which will never be encountered
> with production systems (i.e. in this case we would need to reduce
> scsi_logging_level to reduce console stress).

Its fine to work around using ratelimit but real root cause fix would better.
I call it workaround as the ratelimit means reducing prints and that masks the underlying issue as RHEL 6.3 did different by masking with increasing watchoug threshold to 60 seconds which is really long time for timer interrupt to not occurs and then NMI after 120 seconds long time of timer interrupt being disabled. If you set threshold to 10 seconds then production system kernel such as FC kernel hits same issue either with this test module or FCoE scsi HBA discoverying of 256 luns on intermittent link up/down. I think it would occur with any HBA as scsi console prints are common.

Comment 12 Neil Horman 2012-11-13 19:38:40 UTC

>Test thread does yields CPU as:
Not in any meaningful way it doesn't.

>1) it simply writes prints in thread context thus interrupts, soft IRQ and other >threads becoming higher priorty due to long delay would always preepmt the test >thread.
>
Thats not yielding in any meaningful way.  Interrupts and softirqs will still run, yes, but will quickly exhaust system resources because while they create consumable requests (skbs get queued for reception and processing, data arrives on serial ports or input devices/etc, timers schedule delayed work, etc), nothing ever consumes them.  While tasks at the same or higher priority would get scheduled to interleave with your test module, that won't ever happen because you've set the nice level of these tasks to -20, so while they're in the kernel, they will never get pre-empted by any other user space task, hence the user space lock up, and if nothing in user space handles the data being generated by interrupts, we eventually deadlock.  Simply removing the renicing call in your test module (along with another minor fix, see below), prevents any deadlock from occuring.

>2) the test threads sleeps for 100 seconds after each print cycle, that is >really long period to allow other to finish doing job while no other load with >all networking disabled.
Not quite, your origional code has a bug.  you appear to have intended to have each kthread preform this sequence test_pr_count times:
1)printk(VDBG Console print flood from CPU:)
2) sleep(100 seconds)
3) printk(VDBG out of sleep CPU:%d)

But the while loop indicates you're going to print step 1, test_pr_count times before sleeping at all.  Since test_pr_count defaults to 1000000, you're going to print that first line 1000000 times before sleeping.  if you do that on each cpu, making them compete for the logbuf lock, thats going to take alot of time, especially when you consider that one of those tasks needs to write everything to each console, which is more often then not a hardware device like a SOL or RAC card that will block when its full of buffered data.  Couple this with the fact that each of these runs at a nice of -20, and no other user space task will run at all until these cpus actually allow anything to, and they simply dont.

If you modify your test case to simply remove the nice level and call schedule_timeout between the printks on each iteration, you can sleep for as little as 1 jiffy between each call and the problem doesn't occur.  Alternatively if you don't schedule a printk on one or more cpus, you don't get any lockup, and don't even need to sleep.  Only if you execute on all cpus and don't yield, do you get any sort of deadlock.

>Its okay to be not responsive for time being but I don't see any way the test >thread its self could result in masking timer interrupt or soft IRQ and then >resulting in NMI or soft lockup for above reason, but indirectly these prints >from the threads probably hitting a console bug causing NMI or system lockup.
The NMI triggers for me just fine after it times out (as it does for you, since thats what generated the stack trace in the initial comment of this bz, and irqs and softirqs do in fact run (thats how the softlockup notices are triggered), but eventually the tasks in the test module monopolize the cpu to a degree where its irrelevant, just running irqs and softirqs won't prevent a system from deadlocking.

I think the same thing is happening in your production case, if you encounter an error in a FCoE connection and you wind up calling sd_revalidate_disk.  The code path for that does a fair amount of spin_lock_irqsave, which means that a good portion of the time these printk's are actually being issued with irqs blocked.  And there can be up to 256 worker threads running at any one time.  So I think we're really just seeing another way that a set of tasks is monopolizing the system to the exclusion of other important work.  

We don't necessecarily need to just ratelimit the printks, looking at this, the scsi midlayer looks like its doing a fair amount of redundant work, so eliminating some of that might be an alternate solution.  I'll try to put together a patch for that soon

Comment 13 Neil Horman 2012-11-13 21:09:44 UTC

hmm, upstream commit 083b804c4d3e1e3d0eace56bdbc0f674946d2847 might be useful here.  As it seems the async code that runs in RHEL6 has a horrible habbit ofcreating more threads than we have cpus to run them, and I'm bettting that creates some real contention problems, I'll see if I can get this commit backported, which moves the async code to using a workqueue, whcih should reduce cpu contentino significantly.

Comment 14 Neil Horman 2012-11-14 20:51:59 UTC

Created attachment 645175 [details]
patch to reduce the async thread count to be sane

soo, looking at it, pulling in the above commit implies a complete update of the workqueue code, and that way too much, at least to do it without taking lots of care, I think maybe the rhel6 only async thread count that I noted earlier would be good.

Note, this fix is for RHEL6, not fedora.  But since the description of this problem mixes a problem on rhel6 with the reproducer on fedora, I want to be clear this doesn't apply to the reproducer module, but I think will fix the actual fcoe problem on RHEL6.  Please build a test kernel with this patch and confirm that it solves the origional problem.  Thank you.

Comment 15 Vasu Dev 2012-11-15 18:38:09 UTC

(In reply to comment #14)
> Created attachment 645175 [details]
patch to reduce the async thread count
> to be sane

soo, looking at it, pulling in the above commit implies a
> complete update of the workqueue code, and that way too much, at least to do
> it without taking lots of care, I think maybe the rhel6 only async thread
> count that I noted earlier would be good.

Note, this fix is for RHEL6, not
> fedora.  But since the description of this problem mixes a problem on rhel6
> with the reproducer on fedora, I want to be clear this doesn't apply to the
> reproducer module, but I think will fix the actual fcoe problem on RHEL6. 
> Please build a test kernel with this patch and confirm that it solves the
> origional problem. 

Neil, which original problem to test with this RHEL patch ?  This BZ is for system lock up when console flooded using test module w/o having any networking module loaded and on the console at run-level 3. 

I'm not sure if I can address all recent notes in-line, so instead I'll try to to address key points from your so far findings.

On your point that the per CPU test kernel thread with nice of -20 is the bug in test code as it would starve defered SKBs tasks, I had complete networking disabled on the system, so  no such task to starve. The goal of the test was to hammer just the console with no other load and expect system not to lockup, that means if test modules loaded for some finite time then system would come back alive after test module done at some point of time but that is not the case. If any specific thread you think might starve here then may be testing w/o nice -20 would help. 

On your point possiblly SCSI code is I got the point of just stressing the console test after making sure

Comment 16 Vasu Dev 2012-11-15 18:58:55 UTC

Ooops, in middle of my notes BZ got updated, I'll continue on comment # 15.

On your point that possiblly SCSI code disbling IRQ while issuing prints could lead thread to monoplize threads. I did verify that before and found that NMI occurred from just any single printks while interrupts not disabled. I stored all info in an array then triggred its print on NMI watchdoug while timer interrupt was missed causing NMI, used array to have minimum affect on runtime flows. So unlikely SCSI code is leading NMI and possibbly thereafter lock up, I think down inside printk or console driver implementation possibly waiting on a lock with intr disabled could be the cause here. We can always work this around by limiting prints in code or simply using "dmesg -n 1 or loglevel=1" but instead if console driver could handle load w/o causing NMI then that would be better. Below is the trace I had on NMI indicating no IRQ or bh disaabled before scsi printks when NMI from console printk occured:-

[ 2961.406214] VDBG scsi idx: 23 : func: sd_revalidate_disk line 2562: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406216] VDBG scsi idx: 22 : func: sd_revalidate_disk line 2560: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406217] VDBG scsi idx: 21 : func: sd_revalidate_disk line 2557: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406218] VDBG scsi idx: 20 : func: sd_revalidate_disk line 2555: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406220] VDBG scsi idx: 19 : func: sd_revalidate_disk line 2541: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406221] VDBG scsi idx: 18 : func: sd_revalidate_disk line 2539: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406222] VDBG scsi idx: 17 : func: sd_revalidate_disk line 2537: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612436 and delta:0
[ 2961.406224] VDBG scsi idx: 16 : func: sd_revalidate_disk line 2535: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d431 pip: ffffffff8108b288 CPU: 12 TS:4297612435 and delta:1
[ 2961.406225] VDBG scsi idx: 15 : func: sd_revalidate_disk line 2562: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:168
[ 2961.406226] VDBG scsi idx: 14 : func: sd_revalidate_disk line 2560: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406227] VDBG scsi idx: 13 : func: sd_revalidate_disk line 2557: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406229] VDBG scsi idx: 12 : func: sd_revalidate_disk line 2555: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406230] VDBG scsi idx: 11 : func: sd_revalidate_disk line 2541: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406231] VDBG scsi idx: 10 : func: sd_revalidate_disk line 2539: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406233] VDBG scsi idx: 9 : func: sd_revalidate_disk line 2537: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406234] VDBG scsi idx: 8 : func: sd_revalidate_disk line 2535: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406236] VDBG scsi idx: 7 : func: sd_revalidate_disk line 2494: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406237] VDBG scsi idx: 6 : func: sd_revalidate_disk line 2486: irq_dis: 0 bh_dis: 0  cip: ffffffff812f3c45 pip: ffffffff811c56b6 CPU: 12 TS:4297612267 and delta:0
[ 2961.406238] VDBG scsi idx: 5 : func: sd_revalidate_disk line 2533: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d314 pip: ffffffff8108b288 CPU: 12 TS:4297612266 and delta:1
[ 2961.406240] VDBG scsi idx: 4 : func: sd_revalidate_disk line 2510: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d314 pip: ffffffff8108b288 CPU: 12 TS:4297612266 and delta:0
[ 2961.406241] VDBG scsi idx: 3 : func: sd_revalidate_disk line 2484: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d314 pip: ffffffff8108b288 CPU: 12 TS:4294672865 and delta:2939401
[ 2961.406242] VDBG scsi idx: 2 : func: sd_revalidate_disk line 2466: irq_dis: 0 bh_dis: 0  cip: ffffffff8141d314 pip: ffffffff8108b288 CPU: 12 TS:4294672865 and delta:0
[ 2961.406243] VDBG scsi idx: 1 : func: sd_probe_async line 2672: irq_dis: 0 bh_dis: 0  cip: ffffffff8108b288 pip: ffffffff8107bfcf CPU: 12 TS:4294672813 and delta:52

Comment 17 Neil Horman 2012-11-16 14:49:11 UTC

Vasu, in your initial description of this problem you said you noted this console lockup while add 256 Luns over FCoE during a cable pull test (ostensibly under RHEL 6.3), This patch is meant to address that.  Please build and test a kernel against that test.

Comment 18 Vasu Dev 2012-11-16 20:21:43 UTC

I started on reported  256 FCoE luns & cable pull test causing  lockup on upstream using FC17 config, I didn't see complete lockup but NMI WARN and that was due to 20 seconds temporary lock up. I pursed that further using test module and with that I could see FC17 and RHEL6.3 completey locks up, the test module was streching further which left system locked while networking completely disabled at run level-3. I think there is underlying issue in console or NMI code  and I wish we had help from expert in those area here. As for FCoE RHEL6.3 is fine as watchdog threshold is 60 and RHEL BZ is already verified. The upsteram and FC17 kernels can avoid NMI WARN using less console prints with "dmes -n 1" or loglevel=1. We can workaround but using test module fixing the root cause in console or NMI would have better fix. 

The test module doesn't uses SCSI code so testing SCSI fix patch won't help.

Comment 19 Neil Horman 2012-11-16 20:45:18 UTC

I'm perfectly well aware of what you started with and what the test module does.  What I'm asserting is that:

1) The test module reproduces a problem that would never exist is production environments.  If you write code that prevents the system from doing enough real work, yes, it will stop working properly, the solution is "don't do that".  While it presents simmilar symptoms to what you initially encountered (and in fact does so in a simmilar way), The solution in both cases is don't monopolize the cpus on the system like that.

2) In the initial problem you encountered, it appears to have happened because the code which runs sd_revalidate_disk does so from a library that, in RHEL6, spawns an almost unlimited number of threads (due to an atomic_dec/inc race) that can monopolize the cpu while running in kernel space.  I've found an upstream commit that reworks that code, moving it to a workqueue that limits the thread executution to scale to the number of cpus we have on a given system.  RHEL6 can't take that change because its too invasive, so I've found a way to work around it an approximate what that change does.

So, if you're still seeing the origional problem that you reported, please try the patch.  If you're not seeing it, then we should probably close this bz.

Comment 20 Vasu Dev 2012-11-30 19:57:52 UTC

Original problem was with upstream with FC17 config and that is now has only NMI warnon and that doesn't happen with loglevel=1 kernel cmdline.

The RHEL prior to 6.4 is tested w/o any issue with FCoE in this case. I don't know of any production FCoE test cases and I've seen with only test module with RHEL testing and that was for debugging on RHEL but as you said that is not practical. RHEL might have worked with production test case and its higher watchdog thresh=60 might have helped there. I think identified upstream commit backport is useful in case such issue found in the RHEL fisrt. So bottom line let us limit this BZ to upsream and its warn on doesn't happen with loglevel=1 and we cannot fix issue with test module here as we talked before. So I'm closing this BZ.

Note You need to log in before you can comment on or make changes to this bug.