Bug 537697

Summary:

kernel 2.6.32-0.33... gets "inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage"

Product:

[Fedora] Fedora

Reporter:

Michal Jaegermann <michal>

Component:

kernel

Assignee:

Kernel Maintainer List <kernel-maint>

Status:

CLOSED CURRENTRELEASE

QA Contact:

Fedora Extras Quality Assurance <extras-qa>

Severity:

medium

Docs Contact:

Priority:

low

Version:

CC:

gansalmon, itamar, jforbes, kernel-maint, lemenkov, orion

Target Milestone:

---

Target Release:

---

Hardware:

All

OS:

Linux

Whiteboard:

Fixed In Version:

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Environment:

Last Closed:

2013-09-18 22:43:37 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Attachments:

Description	Flags
dmesg output for kernel-2.6.32-0.33.rc5.git1.fc13.x86_64	none
dmesg for 2.6.32-0.48.rc7.git1.fc13.x86_64 kernel	none
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by work_for_cpu/210")	none
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by swapper/0")	none
trace from kernel-2.6.36-0.21.rc4.git1.fc15.x86_64 ("no locks held by udevd/85" this time)	none
somewhat different trace from 2.6.36-0.26.rc5.git4.fc15.x86_64 ("no locks held by swapper/0")	none
dmesg for 2.6.39-0.rc3.git2.0.fc16.x86_64 ("1 lock held by scsi_eh_1/196")	none
locking error as showing up in 3.0-0.rc7.git0.1.fc16	none

Description Michal Jaegermann 2009-11-15 20:27:56 UTC

Created attachment 369619 [details]
dmesg output for kernel-2.6.32-0.33.rc5.git1.fc13.x86_64

Description of problem:

With a kernel from rawhide I see

[ INFO: inconsistent lock state ]
2.6.32-0.33.rc5.git1.fc13.x86_64 #1
---------------------------------
inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
scsi_eh_1/192 [HC1[1]:SC0[0]:HE0:SE1] takes:
 (&host->lock){?.+...}, at: [<ffffffff813e71d3>] ata_sff_interrupt+0x36/0xbe
{HARDIRQ-ON-W} state was registered at:
  [<ffffffff810ae1d0>] __lock_acquire+0x370/0xd5d
  [<ffffffff810aecb0>] lock_acquire+0xf3/0x12d
  [<ffffffff81564d8f>] _spin_lock+0x45/0x8e
  [<ffffffffa004cc23>] pdc_sata_hardreset+0x70/0xf4 [sata_promise]
  [<ffffffff813df8f5>] ata_do_reset+0x6c/0x8b
  [<ffffffff813e1195>] ata_eh_reset+0x511/0xcb9
  [<ffffffff813e2af7>] ata_eh_recover+0x2d9/0xbd7
  [<ffffffff813e3665>] ata_do_eh+0x5a/0xbb
  [<ffffffff813e3738>] ata_std_error_handler+0x72/0x91
  [<ffffffffa004c71e>] pdc_error_handler+0x33/0x4e [sata_promise]
  [<ffffffff813e40d1>] ata_scsi_error+0x337/0x699
  [<ffffffff813bb9f3>] scsi_error_handler+0x104/0x5bd
  [<ffffffff8109656d>] kthread+0x9a/0xa2
  [<ffffffff810130ea>] child_rip+0xa/0x20

That is followed by bug 520186 which is still there.  Full dmesg attached.

Version-Release number of selected component (if applicable):
kernel-2.6.32-0.33.rc5.git1.fc13.x86_64

How reproducible:
on every boot so far

Comment 1 Bug Zapper 2009-11-16 15:33:56 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 2 Michal Jaegermann 2009-11-18 22:35:22 UTC

An update to 2.6.32-0.48.rc7.git1.fc13.x86_64 replaced
scsi_eh_1/192 ... at ... ata_sff_interrupt+0x36/0xbe with
swapper/0 ... at ... pdc_interrupt+0x3e/0x38d [sata_promise]
Otherwise this looks pretty similar although "other info that might help us debug this" not really the same.  A full dmesg output attached.

Comment 3 Michal Jaegermann 2009-11-18 22:36:43 UTC

Created attachment 370217 [details]
dmesg for 2.6.32-0.48.rc7.git1.fc13.x86_64 kernel

Comment 4 Michal Jaegermann 2009-12-13 22:43:53 UTC

Created attachment 378109 [details]
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by work_for_cpu/210")

Once I got with the current rawhide kernel "no locks held by work_for_cpu/210" and a backtrace like attached.  More typical this will be "no locks held by swapper/0" and something like from the next attachment

Comment 5 Michal Jaegermann 2009-12-13 22:45:26 UTC

Created attachment 378110 [details]
additional information from kernel 2.6.32-7.fc13.x86_64 ("no locks held by swapper/0")

Comment 6 Michal Jaegermann 2010-03-09 00:26:33 UTC

This complaint seems to be gone at last in 2.6.34-...fc14.x86_64 kernels; or at least is not displayed.  The last time I tried this was not the case with 2.6.33-...  kernels as currently used for fc13.

Comment 7 Bug Zapper 2010-03-15 13:01:43 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 13 development cycle.
Changing version to '13'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 8 Orion Poplawski 2010-03-18 16:47:16 UTC

[ INFO: inconsistent lock state ]
2.6.33-1.fc13.i686.PAE #1
---------------------------------
inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
scsi_eh_5/262 [HC0[0]:SC0[0]:HE1:SE1] takes:
 (&(&host->lock)->rlock){?.-...}, at: [<f7e89845>] pdc_sata_hardreset+0x82/0xdc [sata_promise]
{IN-HARDIRQ-W} state was registered at:
  [<c046a2ca>] __lock_acquire+0x266/0xb89
  [<c046ac80>] lock_acquire+0x93/0xb1
  [<c07c56d1>] _raw_spin_lock_irqsave+0x37/0x6a
  [<c06a3d16>] ata_sff_interrupt+0x19/0xad
  [<c048fdad>] handle_IRQ_event+0x4b/0xf6
  [<c0491664>] handle_edge_irq+0xb1/0xf4
  [<c040a79c>] handle_irq+0x40/0x4c
  [<c040a00b>] do_IRQ+0x46/0x9f
  [<c0409175>] common_interrupt+0x35/0x3c
  [<c040ee82>] default_idle+0x44/0x6e
  [<c0407dbe>] cpu_idle+0x9b/0xb5
  [<c07c0752>] start_secondary+0x204/0x242
irq event stamp: 35
hardirqs last  enabled at (35): [<c07c5d44>] _raw_spin_unlock_irqrestore+0x41/0x4d
hardirqs last disabled at (34): [<c07c56ba>] _raw_spin_lock_irqsave+0x20/0x6a
softirqs last  enabled at (0): [<c0440a86>] copy_process+0x319/0x10b1
softirqs last disabled at (0): [<(null)>] (null)

other info that might help us debug this:
no locks held by scsi_eh_5/262.

Comment 9 Michal Jaegermann 2010-09-15 20:26:50 UTC

Created attachment 447565 [details]
trace from kernel-2.6.36-0.21.rc4.git1.fc15.x86_64 ("no locks held by udevd/85" this time)

I did not see that for quite a while but with 2.6.36-0.21.rc4.git1.fc15.x86_64 it looks like that it returned (and it is quite easy to see that):

inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
udevd/85 [HC1[1]:SC0[0]:HE0:SE1] takes:
 (&(&host->lock)->rlock){?.+...}, at: [<ffffffffa0047cf1>] pdc_interrupt+0x40/0x38e [sata_promise]

udevd is from udev-161-2.fc15 and was installed on 2010/09/03 but complaints started to show up only with a new kernel.

More data in an attachment.

Comment 10 Michal Jaegermann 2010-09-27 17:55:04 UTC

Created attachment 449985 [details]
somewhat different trace from 2.6.36-0.26.rc5.git4.fc15.x86_64 ("no locks held by swapper/0")

Comment 11 Michal Jaegermann 2010-12-12 22:09:58 UTC

This issue was apparently absent in recent rawhide kernels.  In particular I did not notice that with 2.6.36-5.fc15.x86_64 or 2.6.36.2-12.rc1.fc15.x86_64.
After updating to 2.6.37-0.rc5.git2.1.fc15.x86_64 I see on every boot:

[    4.144783] =================================
[    4.145006] [ INFO: inconsistent lock state ]
[    4.145006] 2.6.37-0.rc5.git2.1.fc15.x86_64 #1
[    4.145006] ---------------------------------
[    4.145006] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[    4.145006] work_for_cpu/226 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    4.145006]  (&(&host->lock)->rlock){?.+...}, at: [<ffffffffa0047cf4>] pdc_interrupt+0x40/0x38e [sata_promise]
.....
[    4.145006] no locks held by work_for_cpu/226.
.....
(and so on).

Anybody interested in a full trace?  Does not seem to be much different from what was recorded before.

Comment 12 Michal Jaegermann 2011-04-21 01:25:49 UTC

This was not showing up with 2.6.38 kernels.  With 2.6.39 (and microcode.ko module renamed to allow to boot at all) I see again:

[    3.740008] inconsistent {HARDIRQ-ON-W} -> {IN-HARDIRQ-W} usage.
[    3.740008] scsi_eh_1/196 [HC1[1]:SC0[0]:HE0:SE1] takes:
[    3.740008]  (&(&host->lock)->rlock){?.+...}, at: [<ffffffff81354afd>] __ata_sff_interrupt+0x2b/0x17f
[    3.740008] {HARDIRQ-ON-W} state was registered at:
[    3.740008]   [<ffffffff81085f37>] __lock_acquire+0x337/0xc81
[    3.740008]   [<ffffffff81086d11>] lock_acquire+0xd0/0xfb
[    3.740008]   [<ffffffff814bc710>] _raw_spin_lock+0x36/0x6a
[    3.740008]   [<ffffffffa000bb04>] pdc_sata_hardreset+0x5e/0xd5 [sata_promise]
[    3.740008]   [<ffffffff8134e5fd>] ata_do_reset+0x59/0x62
[    3.740008]   [<ffffffff81351400>] ata_eh_reset+0x4e9/0xc49
[    3.740008]   [<ffffffff81351f3d>] ata_eh_recover+0x2ef/0x10d1
[    3.740008]   [<ffffffff8135342e>] ata_do_eh+0x4b/0x98
[    3.740008]   [<ffffffff81355469>] ata_sff_error_handler+0xe3/0xef
[    3.740008]   [<ffffffffa000baa2>] pdc_error_handler+0x21/0x25 [sata_promise]
[    3.740008]   [<ffffffff8135300e>] ata_scsi_port_error_handler+0x244/0x553
[    3.740008]   [<ffffffff813533b8>] ata_scsi_error+0x9b/0xc6
[    3.740008]   [<ffffffff8132f8b4>] scsi_error_handler+0x114/0x5a0
[    3.740008]   [<ffffffff81072d55>] kthread+0xa8/0xb0
[    3.740008]   [<ffffffff814c4ca4>] kernel_thread_helper+0x4/0x10

Full dmesg attached.

Comment 13 Michal Jaegermann 2011-04-21 01:31:48 UTC

Created attachment 493676 [details]
dmesg for 2.6.39-0.rc3.git2.0.fc16.x86_64 ("1 lock held by scsi_eh_1/196")

This time there is:

#0:  (&host->eh_mutex){+.+...}, at: [<ffffffff8134ef93>] ata_eh_acquire+0x20/0x6e

Comment 14 Michal Jaegermann 2011-07-16 17:36:08 UTC

Created attachment 513491 [details]
locking error as showing up in 3.0-0.rc7.git0.1.fc16

This error is clearly inherited also by 3.0-... kernels.  This time an error message is somewhat more scary:

[    5.093008]  Possible unsafe locking scenario:
[    5.093008] 
[    5.093008]        CPU0
[    5.093008]        ----
[    5.093008]   lock(&(&host->lock)->rlock);
[    5.093008]   <Interrupt>
[    5.093008]     lock(&(&host->lock)->rlock);
[    5.093008] 
[    5.093008]  *** DEADLOCK ***
[    5.093008] 
[    5.093008] 1 lock held by udevd/145:

A remainder of a problem report is attached.

Comment 15 Peter Lemenkov 2011-09-10 18:44:23 UTC

Me too was just hit by this.

Comment 16 Fedora End Of Life 2013-04-03 18:42:17 UTC

This bug appears to have been reported against 'rawhide' during the Fedora 19 development cycle.
Changing version to '19'.

(As we did not run this process for some time, it could affect also pre-Fedora 19 development
cycle bugs. We are very sorry. It will help us with cleanup during Fedora 19 End Of Life. Thank you.)

More information and reason for this action is here:
https://fedoraproject.org/wiki/BugZappers/HouseKeeping/Fedora19

Comment 17 Justin M. Forbes 2013-04-05 16:49:20 UTC

Is this still a problem with 3.9 based F19 kernels?

Comment 18 Michal Jaegermann 2013-04-05 17:13:52 UTC

(In reply to comment #17)
> Is this still a problem with 3.9 based F19 kernels?

No.  I do not remember when I have seen that the last time but it was quite a while.

Comment 19 Josh Boyer 2013-09-18 20:53:26 UTC

*********** MASS BUG UPDATE **************

We apologize for the inconvenience.  There is a large number of bugs to go through and several of them have gone stale.  Due to this, we are doing a mass bug update across all of the Fedora 19 kernel bugs.

Fedora 19 has now been rebased to 3.11.1-200.fc19.  Please test this kernel update and let us know if you issue has been resolved or if it is still present with the newer kernel.

If you experience different issues, please open a new bug report for those.

Comment 20 Orion Poplawski 2013-09-18 22:43:37 UTC

Seems to be fixed a while ago.