Bug 537697

Summary: kernel 2.6.32-0.33... gets "inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage"
Product: [Fedora] Fedora Reporter: Michal Jaegermann <michal>
Component: kernelAssignee: Kernel Maintainer List <kernel-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: low    
Version: 19CC: gansalmon, itamar, jforbes, kernel-maint, lemenkov, orion
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2013-09-18 18:43:37 EDT Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Attachments:
Description Flags
dmesg output for kernel-2.6.32-0.33.rc5.git1.fc13.x86_64
none
dmesg for 2.6.32-0.48.rc7.git1.fc13.x86_64 kernel
none
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by work_for_cpu/210")
none
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by swapper/0")
none
trace from kernel-2.6.36-0.21.rc4.git1.fc15.x86_64 ("no locks held by udevd/85" this time)
none
somewhat different trace from 2.6.36-0.26.rc5.git4.fc15.x86_64 ("no locks held by swapper/0")
none
dmesg for 2.6.39-0.rc3.git2.0.fc16.x86_64 ("1 lock held by scsi_eh_1/196")
none
locking error as showing up in 3.0-0.rc7.git0.1.fc16 none

Description Michal Jaegermann 2009-11-15 15:27:56 EST
Created attachment 369619 [details]
dmesg output for kernel-2.6.32-0.33.rc5.git1.fc13.x86_64

Description of problem:

With a kernel from rawhide I see

[ INFO: inconsistent lock state ]
2.6.32-0.33.rc5.git1.fc13.x86_64 #1
---------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
scsi_eh_1/192 [HC1[1]:SC0[0]:HE0:SE1] takes:
 (&host->lock){?.+...}, at: [<ffffffff813e71d3>] ata_sff_interrupt+0x36/0xbe
{HARDIRQ-ON-W} state was registered at:
  [<ffffffff810ae1d0>] __lock_acquire+0x370/0xd5d
  [<ffffffff810aecb0>] lock_acquire+0xf3/0x12d
  [<ffffffff81564d8f>] _spin_lock+0x45/0x8e
  [<ffffffffa004cc23>] pdc_sata_hardreset+0x70/0xf4 [sata_promise]
  [<ffffffff813df8f5>] ata_do_reset+0x6c/0x8b
  [<ffffffff813e1195>] ata_eh_reset+0x511/0xcb9
  [<ffffffff813e2af7>] ata_eh_recover+0x2d9/0xbd7
  [<ffffffff813e3665>] ata_do_eh+0x5a/0xbb
  [<ffffffff813e3738>] ata_std_error_handler+0x72/0x91
  [<ffffffffa004c71e>] pdc_error_handler+0x33/0x4e [sata_promise]
  [<ffffffff813e40d1>] ata_scsi_error+0x337/0x699
  [<ffffffff813bb9f3>] scsi_error_handler+0x104/0x5bd
  [<ffffffff8109656d>] kthread+0x9a/0xa2
  [<ffffffff810130ea>] child_rip+0xa/0x20

That is followed by bug 520186 which is still there.  Full dmesg attached.

Version-Release number of selected component (if applicable):
kernel-2.6.32-0.33.rc5.git1.fc13.x86_64

How reproducible:
on every boot so far
Comment 1 Bug Zapper 2009-11-16 10:33:56 EST
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 2 Michal Jaegermann 2009-11-18 17:35:22 EST
An update to 2.6.32-0.48.rc7.git1.fc13.x86_64 replaced
scsi_eh_1/192 ... at ... ata_sff_interrupt+0x36/0xbe with
swapper/0 ... at ... pdc_interrupt+0x3e/0x38d [sata_promise]
Otherwise this looks pretty similar although "other info that might help us debug this" not really the same.  A full dmesg output attached.
Comment 3 Michal Jaegermann 2009-11-18 17:36:43 EST
Created attachment 370217 [details]
dmesg for 2.6.32-0.48.rc7.git1.fc13.x86_64 kernel
Comment 4 Michal Jaegermann 2009-12-13 17:43:53 EST
Created attachment 378109 [details]
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by work_for_cpu/210")

Once I got with the current rawhide kernel "no locks held by work_for_cpu/210" and a backtrace like attached.  More typical this will be "no locks held by swapper/0" and something like from the next attachment
Comment 5 Michal Jaegermann 2009-12-13 17:45:26 EST
Created attachment 378110 [details]
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by swapper/0")
Comment 6 Michal Jaegermann 2010-03-08 19:26:33 EST
This complaint seems to be gone at last in 2.6.34-...fc14.x86_64 kernels; or at least is not displayed.  The last time I tried this was not the case with 2.6.33-...  kernels as currently used for fc13.
Comment 7 Bug Zapper 2010-03-15 09:01:43 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping
Comment 8 Orion Poplawski 2010-03-18 12:47:16 EDT
[ INFO: inconsistent lock state ]
2.6.33-1.fc13.i686.PAE #1
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
scsi_eh_5/262 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&host->lock)->rlock){?.-...}, at: [<f7e89845>] pdc_sata_hardreset+0x82/0xdc [sata_promise]
{IN-HARDIRQ-W} state was registered at:
  [<c046a2ca>] __lock_acquire+0x266/0xb89
  [<c046ac80>] lock_acquire+0x93/0xb1
  [<c07c56d1>] _raw_spin_lock_irqsave+0x37/0x6a
  [<c06a3d16>] ata_sff_interrupt+0x19/0xad
  [<c048fdad>] handle_IRQ_event+0x4b/0xf6
  [<c0491664>] handle_edge_irq+0xb1/0xf4
  [<c040a79c>] handle_irq+0x40/0x4c
  [<c040a00b>] do_IRQ+0x46/0x9f
  [<c0409175>] common_interrupt+0x35/0x3c
  [<c040ee82>] default_idle+0x44/0x6e
  [<c0407dbe>] cpu_idle+0x9b/0xb5
  [<c07c0752>] start_secondary+0x204/0x242
irq event stamp: 35
hardirqs last  enabled at (35): [<c07c5d44>] _raw_spin_unlock_irqrestore+0x41/0x4d
hardirqs last disabled at (34): [<c07c56ba>] _raw_spin_lock_irqsave+0x20/0x6a
softirqs last  enabled at (0): [<c0440a86>] copy_process+0x319/0x10b1
softirqs last disabled at (0): [<(null)>] (null)

other info that might help us debug this:
no locks held by scsi_eh_5/262.
Comment 9 Michal Jaegermann 2010-09-15 16:26:50 EDT
Created attachment 447565 [details]
trace from kernel-2.6.36-0.21.rc4.git1.fc15.x86_64 ("no locks held by udevd/85" this time)

I did not see that for quite a while but with 2.6.36-0.21.rc4.git1.fc15.x86_64 it looks like that it returned (and it is quite easy to see that):

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
udevd/85 [HC1[1]:SC0[0]:HE0:SE1] takes:
 (&(&host->lock)->rlock){?.+...}, at: [<ffffffffa0047cf1>] pdc_interrupt+0x40/0x38e [sata_promise]

udevd is from udev-161-2.fc15 and was installed on 2010/09/03 but complaints started to show up only with a new kernel.

More data in an attachment.
Comment 10 Michal Jaegermann 2010-09-27 13:55:04 EDT
Created attachment 449985 [details]
somewhat different trace from 2.6.36-0.26.rc5.git4.fc15.x86_64 ("no locks held by swapper/0")
Comment 11 Michal Jaegermann 2010-12-12 17:09:58 EST
This issue was apparently absent in recent rawhide kernels.  In particular I did not notice that with 2.6.36-5.fc15.x86_64 or 2.6.36.2-12.rc1.fc15.x86_64.
After updating to 2.6.37-0.rc5.git2.1.fc15.x86_64 I see on every boot:

[    4.144783] =================================
[    4.145006] [ INFO: inconsistent lock state ]
[    4.145006] 2.6.37-0.rc5.git2.1.fc15.x86_64 #1
[    4.145006] ---------------------------------
[    4.145006] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[    4.145006] work_for_cpu/226 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    4.145006]  (&(&host->lock)->rlock){?.+...}, at: [<ffffffffa0047cf4>] pdc_interrupt+0x40/0x38e [sata_promise]
.....
[    4.145006] no locks held by work_for_cpu/226.
.....
(and so on).

Anybody interested in a full trace?  Does not seem to be much different from what was recorded before.
Comment 12 Michal Jaegermann 2011-04-20 21:25:49 EDT
This was not showing up with 2.6.38 kernels.  With 2.6.39 (and microcode.ko module renamed to allow to boot at all) I see again:

[    3.740008] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[    3.740008] scsi_eh_1/196 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    3.740008]  (&(&host->lock)->rlock){?.+...}, at: [<ffffffff81354afd>] __ata_sff_interrupt+0x2b/0x17f
[    3.740008] {HARDIRQ-ON-W} state was registered at:
[    3.740008]   [<ffffffff81085f37>] __lock_acquire+0x337/0xc81
[    3.740008]   [<ffffffff81086d11>] lock_acquire+0xd0/0xfb
[    3.740008]   [<ffffffff814bc710>] _raw_spin_lock+0x36/0x6a
[    3.740008]   [<ffffffffa000bb04>] pdc_sata_hardreset+0x5e/0xd5 [sata_promise]
[    3.740008]   [<ffffffff8134e5fd>] ata_do_reset+0x59/0x62
[    3.740008]   [<ffffffff81351400>] ata_eh_reset+0x4e9/0xc49
[    3.740008]   [<ffffffff81351f3d>] ata_eh_recover+0x2ef/0x10d1
[    3.740008]   [<ffffffff8135342e>] ata_do_eh+0x4b/0x98
[    3.740008]   [<ffffffff81355469>] ata_sff_error_handler+0xe3/0xef
[    3.740008]   [<ffffffffa000baa2>] pdc_error_handler+0x21/0x25 [sata_promise]
[    3.740008]   [<ffffffff8135300e>] ata_scsi_port_error_handler+0x244/0x553
[    3.740008]   [<ffffffff813533b8>] ata_scsi_error+0x9b/0xc6
[    3.740008]   [<ffffffff8132f8b4>] scsi_error_handler+0x114/0x5a0
[    3.740008]   [<ffffffff81072d55>] kthread+0xa8/0xb0
[    3.740008]   [<ffffffff814c4ca4>] kernel_thread_helper+0x4/0x10

Full dmesg attached.
Comment 13 Michal Jaegermann 2011-04-20 21:31:48 EDT
Created attachment 493676 [details]
dmesg for 2.6.39-0.rc3.git2.0.fc16.x86_64 ("1 lock held by scsi_eh_1/196")

This time there is:

#0:  (&host->eh_mutex){+.+...}, at: [<ffffffff8134ef93>] ata_eh_acquire+0x20/0x6e
Comment 14 Michal Jaegermann 2011-07-16 13:36:08 EDT
Created attachment 513491 [details]
locking error as showing up in 3.0-0.rc7.git0.1.fc16

This error is clearly inherited also by 3.0-... kernels.  This time an error message is somewhat more scary:

[    5.093008]  Possible unsafe locking scenario:
[    5.093008] 
[    5.093008]        CPU0
[    5.093008]        ----
[    5.093008]   lock(&(&host->lock)->rlock);
[    5.093008]   <Interrupt>
[    5.093008]     lock(&(&host->lock)->rlock);
[    5.093008] 
[    5.093008]  *** DEADLOCK ***
[    5.093008] 
[    5.093008] 1 lock held by udevd/145:

A remainder of a problem report is attached.
Comment 15 Peter Lemenkov 2011-09-10 14:44:23 EDT
Me too was just hit by this.
Comment 16 Fedora End Of Life 2013-04-03 14:42:17 EDT
This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19
Comment 17 Justin M. Forbes 2013-04-05 12:49:20 EDT
Is this still a problem with 3.9 based F19 kernels?
Comment 18 Michal Jaegermann 2013-04-05 13:13:52 EDT
(In reply to comment #17)
> Is this still a problem with 3.9 based F19 kernels?

No.  I do not remember when I have seen that the last time but it was quite a while.
Comment 19 Josh Boyer 2013-09-18 16:53:26 EDT
*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.
Comment 20 Orion Poplawski 2013-09-18 18:43:37 EDT
Seems to be fixed a while ago.