Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 668775

Summary:

BKL (lock_kernel) in soft lockup during parallel IO discovery

Product:

Red Hat Enterprise Linux 6

Reporter:

Tim Wilkinson <twilkins>

Component:

kernel

Assignee:

Jeff Moyer <jmoyer>

Status:

CLOSED ERRATA

QA Contact:

Gris Ge <fge>

Severity:

urgent

Docs Contact:

Priority:

high

Version:

6.1

CC:

bdonahue, buchino, chellwig, coughlan, czhang, dshaks, dzickus, eddie.williams, fge, james.leddy, jmoyer, kzhang, msnitzer, perfbz, prarit, pzijlstr, rwheeler, sgandhar, stbechto, woodard

Target Milestone:

Target Release:

---

Hardware:

Unspecified

OS:

Linux

Whiteboard:

Fixed In Version:

kernel-2.6.32-151.el6

Doc Type:

Bug Fix

Doc Text:

Story Points:

---

Clone Of:

Clones:

690523 1225598 (view as bug list)

Environment:

Last Closed:

2011-12-06 12:36:48 UTC

Type:

---

Regression:

---

Mount Type:

---

Documentation:

---

CRM:

Verified Versions:

Category:

---

oVirt Team:

---

RHEL 7.3 requirements from Atomic Host:

Cloudforms Team:

---

Target Upstream Version:

Embargoed:

Bug Depends On:

Bug Blocks:

658636, 690523, 699556, 1225598

Attachments:

Description	Flags
sysrq-t output during high load	none

Description Tim Wilkinson 2011-01-11 15:34:08 UTC

Description of problem:
----------------------
A big kernel lock involved in a soft lockup at boot time on an HP DL980 G7 server with hyperthreading enabled (128 "CPUs"). Logged output indicates device discovery is in progress when the lockup occurs.

The storage connected is 120 FC volumes on 15 P2000 RAID controllers connected via 30 HBAs. The server is running RHEL6 RC1.

This has been observed routinely when attempting to boot at 128 CPUs and occurs, albeit much less frequently, even when the system is not hyperthreaded and is running at 64 CPUs.



Component Version-Release:
-------------------------
2.6.32-71.el6.x86_64



How reproducible:
----------------
Consistent when booting with HT at 128 CPUs



Additional info:
---------------
sd 8:0:20:3: [sddq] 1145716064 512-byte logical blocks: (586 GB/546 GiB)
scsi 8:0:20:5: Direct-Access     HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
sd 8:0:20:3: [sddq] Write Protect is off
sd 8:0:20:3: [sddq] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
sd 8:0:20:5: [sddr] 1145716064 512-byte logical blocks: (586 GB/546 GiB)
sd 8:0:20:5: [sddr] Write Protect is off
sd 8:0:20:5: [sddr] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
scsi 14:0:21:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 8:0:21:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:22:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 8:0:22:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:23:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 8:0:23:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:24:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 8:0:24:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:25:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:26:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:27:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:28:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:29:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:30:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:31:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:32:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:33:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:34:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:35:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:36:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:37:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:38:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
scsi 14:0:39:0: Enclosure         HP       P2000 G3 FC      T200 PQ: 0 ANSI: 5
BUG: soft lockup - CPU#56 stuck for 61s! [async/96:2718]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 56:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2718, comm: async/96 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847da309d30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847da309d30 RCX: 000000000000184f
RDX: 00000000000018ab RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847da309cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88e01c400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#98 stuck for 61s! [async/13:2635]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 98:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2635, comm: async/13 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9a1dd30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847d9a1dd30 RCX: 000000000000184f
RDX: 0000000000001854 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a1dcc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88801c540000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007f30f79e70f0 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#113 stuck for 61s! [async/8:2630]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 113:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2630, comm: async/8 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847d9a0dd30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9a0dd30 RCX: 000000000000184f
RDX: 0000000000001850 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a0dcc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88c01c520000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fffba5d7c8c CR3: 000000c7db9ae000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#0 stuck for 61s! [async/85:2707]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 0:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2707, comm: async/85 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9ba7d30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9ba7d30 RCX: 000000000000184f
RDX: 000000000000189d RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9ba7cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff880028200000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc72ab07b60 CR3: 000000c7da391000 CR4: 00000000000006f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#40 stuck for 61s! [async/99:2721]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 40:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2721, comm: async/99 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814caca8>]  [<ffffffff814caca8>] lock_kernel+0x28/0x50
RSP: 0018:ffff8847da301d30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847da301d30 RCX: 000000000000184f
RDX: 00000000000018a9 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847da301cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88a01c600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000623478 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#52 stuck for 61s! [async/9:2631]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 52:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2631, comm: async/9 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847d9a11d30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9a11d30 RCX: 000000000000184f
RDX: 0000000000001853 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a11cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88c01c480000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc72ad9d4f8 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#73 stuck for 61s! [async/11:2633]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 73:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2633, comm: async/11 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9a17d30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847d9a17d30 RCX: 000000000000184f
RDX: 0000000000001852 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a17cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88201c720000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000001cef4a8 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#81 stuck for 61s! [async/10:2632]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 81:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2632, comm: async/10 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847d9a13d30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847d9a13d30 RCX: 000000000000184f
RDX: 0000000000001851 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a13cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c520000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000623478 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#83 stuck for 61s! [async/29:2651]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 83:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2651, comm: async/29 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847d9a53d30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9a53d30 RCX: 000000000000184f
RDX: 0000000000001864 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a53cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c560000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc72aafc2f0 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#84 stuck for 61s! [async/78:2700]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 84:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2700, comm: async/78 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9b91d30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847d9b91d30 RCX: 000000000000184f
RDX: 0000000000001895 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9b91cc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c580000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00007fc72aafc2f0 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#85 stuck for 61s! [async/37:2659]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 85:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2659, comm: async/37 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacae>]  [<ffffffff814cacae>] lock_kernel+0x2e/0x50
RSP: 0018:ffff8847d9a6dd30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9a6dd30 RCX: 000000000000184f
RDX: 000000000000186d RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9a6dcc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c5a0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#86 stuck for 61s! [async/57:2679]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 86:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2679, comm: async/57 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9aefd30  EFLAGS: 00000283
RAX: 0000000000000000 RBX: ffff8847d9aefd30 RCX: 000000000000184f
RDX: 0000000000001881 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9aefcc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c5c0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#87 stuck for 61s! [async/56:2678]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
CPU 87:
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod
Pid: 2678, comm: async/56 Not tainted 2.6.32-71.el6.x86_64 #1 ProLiant DL980 G7
RIP: 0010:[<ffffffff814cacb5>]  [<ffffffff814cacb5>] lock_kernel+0x35/0x50
RSP: 0018:ffff8847d9aebd30  EFLAGS: 00000287
RAX: 0000000000000000 RBX: ffff8847d9aebd30 RCX: 000000000000184f
RDX: 0000000000001880 RSI: 0000000000000004 RDI: ffff88c7dccc6400
RBP: ffffffff81013c8e R08: 0000000000000004 R09: ffff88c7dccc6410
R10: ffff88a7dccb6560 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff81200840 R14: ffff8847d9aebcc0 R15: ffff88e7dcccb400
FS:  0000000000000000(0000) GS:ffff88401c5e0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
 [<ffffffff811a4320>] ? __blkdev_get+0x50/0x3c0
 [<ffffffff8125ac77>] ? kobject_put+0x27/0x60
 [<ffffffff811a46a0>] ? blkdev_get+0x10/0x20
 [<ffffffff811dae35>] ? register_disk+0x155/0x170
 [<ffffffff8124901c>] ? add_disk+0x8c/0x160
 [<ffffffffa017b34b>] ? sd_probe_async+0x13b/0x210 [sd_mod]
 [<ffffffff81092006>] ? add_wait_queue+0x46/0x60
 [<ffffffff81099042>] ? async_thread+0x102/0x250
 [<ffffffff8105c490>] ? default_wake_function+0x0/0x20
 [<ffffffff81098f40>] ? async_thread+0x0/0x250
 [<ffffffff81091936>] ? kthread+0x96/0xa0
 [<ffffffff810141ca>] ? child_rip+0xa/0x20
 [<ffffffff810918a0>] ? kthread+0x0/0xa0
 [<ffffffff810141c0>] ? child_rip+0x0/0x20
BUG: soft lockup - CPU#88 stuck for 61s! [async/88:2710]
Modules linked in: sd_mod crc_t10dif ata_generic pata_acpi ata_piix hpsa qla2xxx scsi_transport_fc scsi_tgt radeon ttm drm_kms_helper drm i2c_algo_bit i2c_core dm_mod

Comment 2 RHEL Program Management 2011-02-01 05:50:40 UTC

This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 3 RHEL Program Management 2011-02-01 19:05:30 UTC

This request was erroneously denied for the current release of
Red Hat Enterprise Linux.  The error has been fixed and this
request has been re-proposed for the current release.

Comment 4 Sumeet Gandhare 2011-03-14 14:20:52 UTC

Seeing following traces on a system trying to modprobe LPFC.

crash>  bt 1774
PID: 1774   TASK: ffff8804286ef4a0  CPU: 5   COMMAND: "async/15"
 #0 [ffff880028347e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028347e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028347ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028347ee0] notify_die at ffffffff81096fce
 #4 [ffff880028347f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028347f50] nmi at ffffffff814cbf00
    [exception RIP: smp_call_function_many+434]
    RIP: ffffffff810a7372  RSP: ffff88042b391c90  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: ffff8800283520a0  RCX: 0000000000000008
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000286
    RBP: ffff88042b391cd0   R8: 0000000000000000   R9: ffff88041fedde00
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000005
    R13: ffffffff818a3ba0  R14: ffffffff818a3ba0  R15: ffffffff8119dcb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88042b391c90] smp_call_function_many at ffffffff810a7372
 #7 [ffff88042b391cd8] smp_call_function at ffffffff810a73f2
 #8 [ffff88042b391ce8] on_each_cpu at ffffffff81073914
 #9 [ffff88042b391d18] invalidate_bh_lrus at ffffffff8119d92c
#10 [ffff88042b391d28] kill_bdev at ffffffff811a3f58
#11 [ffff88042b391d48] __blkdev_put at ffffffff811a4c60
#12 [ffff88042b391d98] blkdev_put at ffffffff811a4d50
#13 [ffff88042b391da8] register_disk at ffffffff811db82a
#14 [ffff88042b391df8] add_disk at ffffffff81249acc
#15 [ffff88042b391e28] sd_probe_async at ffffffffa017a34b
#16 [ffff88042b391e68] async_thread at ffffffff81099192
#17 [ffff88042b391ee8] kthread at ffffffff81091a86
#18 [ffff88042b391f48] kernel_thread at ffffffff810141ca
crash>



crash> bt -a
PID: 2317   TASK: ffff88041d0ef4a0  CPU: 0   COMMAND: "blkid"
 #0 [ffff880028207e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028207e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028207ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028207ee0] notify_die at ffffffff81096fce
 #4 [ffff880028207f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028207f50] nmi at ffffffff814cbf00
    [exception RIP: lock_kernel+46]
    RIP: ffffffff814cb9de  RSP: ffff88041d0f1e28  RFLAGS: 00000297
    RAX: 0000000000000000  RBX: ffff880420e67500  RCX: 0000000000001e6b
    RDX: 0000000000001e6f  RSI: 000000000000101d  RDI: ffff880420e67520
    RBP: ffff88041d0f1e28   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000246  R12: 0000000000000000
    R13: 000000000000101d  R14: ffff880420e67520  R15: ffff880429dd0c00
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88041d0f1e28] lock_kernel at ffffffff814cb9de
 #7 [ffff88041d0f1e30] __blkdev_put at ffffffff811a4bf2
 #8 [ffff88041d0f1e80] blkdev_put at ffffffff811a4d50
 #9 [ffff88041d0f1e90] blkdev_close at ffffffff811a4d93
#10 [ffff88041d0f1ec0] __fput at ffffffff8116eb05
#11 [ffff88041d0f1f10] fput at ffffffff8116ec45
#12 [ffff88041d0f1f20] filp_close at ffffffff8116a19d
#13 [ffff88041d0f1f50] sys_close at ffffffff8116a275
#14 [ffff88041d0f1f80] system_call_fastpath at ffffffff81013172
    RIP: 0000003dfeed4150  RSP: 00007fff3cc56dd8  RFLAGS: 00010202
    RAX: 0000000000000003  RBX: ffffffff81013172  RCX: 0000003e01020ba0
    RDX: 0000000000000000  RSI: 0000000000000000  RDI: 0000000000000003
    RBP: 0000000000000003   R8: 00000000013487e8   R9: 0000000000300000
    R10: 0000000000000000  R11: 0000000000000246  R12: 00007fff3cc57c4d
    R13: 00007fff3cc572c0  R14: 0000000000000000  R15: 00007fff3cc572c0
    ORIG_RAX: 0000000000000003  CS: 0033  SS: 002b

PID: 28     TASK: ffff88042e2f8b30  CPU: 1   COMMAND: "events/1"
 #0 [ffff880028247e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028247e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028247ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028247ee0] notify_die at ffffffff81096fce
 #4 [ffff880028247f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028247f50] nmi at ffffffff814cbf00
    [exception RIP: cfb_imageblit+1213]
    RIP: ffffffff812adf8d  RSP: ffff88042e31b950  RFLAGS: 00000046
    RAX: ffffc90012d77134  RBX: ffff880429104978  RCX: 0000000000000000
    RDX: 0000000000000000  RSI: 000000000000000f  RDI: 0000000000000000
    RBP: ffff88042e31b9d0   R8: 0000000007070707   R9: 0000000000000004
    R10: ffffffff81513520  R11: ffff880429104974  R12: 0000000000000003
    R13: ffffc90012d77168  R14: 0000000000000400  R15: 000000000000000b
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88042e31b950] cfb_imageblit at ffffffff812adf8d
 #7 [ffff88042e31b9d8] bit_putcs at ffffffff812a6e9e
 #8 [ffff88042e31bb28] fbcon_putcs at ffffffff812a3646
 #9 [ffff88042e31bba8] fbcon_redraw at ffffffff812a3a2c
#10 [ffff88042e31bc18] fbcon_scroll at ffffffff812a5543
#11 [ffff88042e31bc98] scrup at ffffffff81307280
#12 [ffff88042e31bcd8] lf at ffffffff8130741d
#13 [ffff88042e31bcf8] vt_console_print at ffffffff813091e2
#14 [ffff88042e31bd58] __call_console_drivers at ffffffff8106baa5
#15 [ffff88042e31bd88] _call_console_drivers at ffffffff8106bb0a
#16 [ffff88042e31bda8] release_console_sem at ffffffff8106bfe8
#17 [ffff88042e31bde8] fb_flashcursor at ffffffff812a08ea
#18 [ffff88042e31be38] worker_thread at ffffffff8108c720
#19 [ffff88042e31bee8] kthread at ffffffff81091a86
#20 [ffff88042e31bf48] kernel_thread at ffffffff810141ca

PID: 1908   TASK: ffff88042a39eb30  CPU: 2   COMMAND: "async/19"
 #0 [ffff8800282836b0] sysrq_handle_crash at ffffffff8130dfe6
 #1 [ffff8800282836d0] pointer at ffffffff812628c3
 #2 [ffff880028283868] sysrq_handle_crash at ffffffff8130dfe6
 #3 [ffff8800282838f0] machine_kexec at ffffffff8103697b
 #4 [ffff880028283950] crash_kexec at ffffffff810b9078
 #5 [ffff880028283a20] oops_end at ffffffff814cc900
 #6 [ffff880028283a50] no_context at ffffffff8104652b
 #7 [ffff880028283aa0] __bad_area_nosemaphore at ffffffff810467b5
 #8 [ffff880028283af0] bad_area_nosemaphore at ffffffff81046883
 #9 [ffff880028283b00] do_page_fault at ffffffff814ce388
#10 [ffff880028283b50] page_fault at ffffffff814cbc75
    [exception RIP: sysrq_handle_crash+22]
    RIP: ffffffff8130dfe6  RSP: ffff880028283c08  RFLAGS: 00010092
    RAX: 0000000000000010  RBX: 0000000000000063  RCX: 00000000000008bc
    RDX: 0000000000000000  RSI: ffff880429a15000  RDI: 0000000000000063
    RBP: ffff880028283c08   R8: 0000000000000073   R9: 0000000000000000
    R10: 00000000000000fa  R11: 0000000000000000  R12: ffff880429a15000
    R13: ffffffff817a0700  R14: 0000000000000086  R15: 0000000000000007
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
#11 [ffff880028283c10] __handle_sysrq at ffffffff8130e2a2
#12 [ffff880028283c60] handle_sysrq at ffffffff8130e38b
#13 [ffff880028283c70] kbd_event at ffffffff81305251
#14 [ffff880028283ce0] input_pass_event at ffffffff813b11c6
#15 [ffff880028283d20] input_handle_event at ffffffff813b2dc3
#16 [ffff880028283d60] input_event at ffffffff813b36c4
#17 [ffff880028283db0] atkbd_interrupt at ffffffff813ba370
#18 [ffff880028283e20] serio_interrupt at ffffffff813ae3f2
#19 [ffff880028283e60] i8042_interrupt at ffffffff813af1b2
#20 [ffff880028283ed0] handle_IRQ_event at ffffffff810d8960
#21 [ffff880028283f20] handle_edge_irq at ffffffff810db046
#22 [ffff880028283f60] handle_irq at ffffffff81015fb9
#23 [ffff880028283f80] do_IRQ at ffffffff814d063c
--- <IRQ stack> ---
#24 [ffff88042aca7c88] ret_from_intr at ffffffff81013ad3
    [exception RIP: lock_kernel+53]
    RIP: ffffffff814cb9e5  RSP: ffff88042aca7d30  RFLAGS: 00000293
    RAX: 0000000000000000  RBX: ffff88042aca7d30  RCX: 0000000000001e6b
    RDX: 0000000000001e6e  RSI: 0000000000000004  RDI: ffff88042e83fc00
    RBP: ffffffff81013ace   R8: 0000000000000004   R9: ffff88042e83fc10
    R10: ffff88042e83fc60  R11: 0000000000000000  R12: 0000000000000001
    R13: ffffffff81201230  R14: ffff88042aca7cc0  R15: ffff88042e098c00
    ORIG_RAX: ffffffffffffffce  CS: 0010  SS: 0018
#25 [ffff88042aca7d38] __blkdev_get at ffffffff811a4e50
#26 [ffff88042aca7d98] blkdev_get at ffffffff811a51d0
#27 [ffff88042aca7da8] register_disk at ffffffff811db815
#28 [ffff88042aca7df8] add_disk at ffffffff81249acc
#29 [ffff88042aca7e28] sd_probe_async at ffffffffa017a34b
#30 [ffff88042aca7e68] async_thread at ffffffff81099192
#31 [ffff88042aca7ee8] kthread at ffffffff81091a86
#32 [ffff88042aca7f48] kernel_thread at ffffffff810141ca

PID: 1964   TASK: ffff88042c268af0  CPU: 3   COMMAND: "async/31"
 #0 [ffff8800282c7e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff8800282c7e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff8800282c7ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff8800282c7ee0] notify_die at ffffffff81096fce
 #4 [ffff8800282c7f10] do_nmi at ffffffff814cc60c
 #5 [ffff8800282c7f50] nmi at ffffffff814cbf00
    [exception RIP: lock_kernel+46]
    RIP: ffffffff814cb9de  RSP: ffff88042c295d30  RFLAGS: 00000293
    RAX: 0000000000000000  RBX: ffff880427036a80  RCX: 0000000000001e6b
    RDX: 0000000000001e6d  RSI: 0000000000000004  RDI: ffff88042e83fc00
    RBP: ffff88042c295d30   R8: 0000000000000004   R9: ffff88042e83fc10
    R10: ffff88042e83fc60  R11: 0000000000000000  R12: ffff880427036a80
    R13: 0000000000000000  R14: ffff88042b94c138  R15: 0000000000000001
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88042c295d30] lock_kernel at ffffffff814cb9de
 #7 [ffff88042c295d38] __blkdev_get at ffffffff811a4e50
 #8 [ffff88042c295d98] blkdev_get at ffffffff811a51d0
 #9 [ffff88042c295da8] register_disk at ffffffff811db815
#10 [ffff88042c295df8] add_disk at ffffffff81249acc
#11 [ffff88042c295e28] sd_probe_async at ffffffffa017a34b
#12 [ffff88042c295e68] async_thread at ffffffff81099192
#13 [ffff88042c295ee8] kthread at ffffffff81091a86
#14 [ffff88042c295f48] kernel_thread at ffffffff810141ca

PID: 1771   TASK: ffff880428baea70  CPU: 4   COMMAND: "async/13"
 #0 [ffff880028307e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028307e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028307ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028307ee0] notify_die at ffffffff81096fce
 #4 [ffff880028307f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028307f50] nmi at ffffffff814cbf00
    [exception RIP: lock_kernel+46]
    RIP: ffffffff814cb9de  RSP: ffff88042b289d40  RFLAGS: 00000297
    RAX: 0000000000000000  RBX: ffff880420c830c0  RCX: 0000000000001e6b
    RDX: 0000000000001e6c  RSI: 0000000000000001  RDI: ffff880420c830e0
    RBP: ffff88042b289d40   R8: 0000000000000000   R9: 0000000000000000
    R10: 0000000000000002  R11: 0000000000000000  R12: 0000000000000000
    R13: 0000000000000001  R14: ffff880420c830e0  R15: ffff88042b52e000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88042b289d40] lock_kernel at ffffffff814cb9de
 #7 [ffff88042b289d48] __blkdev_put at ffffffff811a4bf2
 #8 [ffff88042b289d98] blkdev_put at ffffffff811a4d50
 #9 [ffff88042b289da8] register_disk at ffffffff811db82a
#10 [ffff88042b289df8] add_disk at ffffffff81249acc
#11 [ffff88042b289e28] sd_probe_async at ffffffffa017a34b
#12 [ffff88042b289e68] async_thread at ffffffff81099192
#13 [ffff88042b289ee8] kthread at ffffffff81091a86
#14 [ffff88042b289f48] kernel_thread at ffffffff810141ca

PID: 1774   TASK: ffff8804286ef4a0  CPU: 5   COMMAND: "async/15"
 #0 [ffff880028347e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028347e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028347ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028347ee0] notify_die at ffffffff81096fce
 #4 [ffff880028347f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028347f50] nmi at ffffffff814cbf00
    [exception RIP: smp_call_function_many+434]
    RIP: ffffffff810a7372  RSP: ffff88042b391c90  RFLAGS: 00000202
    RAX: 0000000000000002  RBX: ffff8800283520a0  RCX: 0000000000000008
    RDX: 0000000000000001  RSI: 0000000000000000  RDI: 0000000000000286
    RBP: ffff88042b391cd0   R8: 0000000000000000   R9: ffff88041fedde00
    R10: 0000000000000000  R11: 0000000000000001  R12: 0000000000000005
    R13: ffffffff818a3ba0  R14: ffffffff818a3ba0  R15: ffffffff8119dcb0
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88042b391c90] smp_call_function_many at ffffffff810a7372
 #7 [ffff88042b391cd8] smp_call_function at ffffffff810a73f2
 #8 [ffff88042b391ce8] on_each_cpu at ffffffff81073914
 #9 [ffff88042b391d18] invalidate_bh_lrus at ffffffff8119d92c
#10 [ffff88042b391d28] kill_bdev at ffffffff811a3f58
#11 [ffff88042b391d48] __blkdev_put at ffffffff811a4c60
#12 [ffff88042b391d98] blkdev_put at ffffffff811a4d50
#13 [ffff88042b391da8] register_disk at ffffffff811db82a
#14 [ffff88042b391df8] add_disk at ffffffff81249acc
#15 [ffff88042b391e28] sd_probe_async at ffffffffa017a34b
#16 [ffff88042b391e68] async_thread at ffffffff81099192
#17 [ffff88042b391ee8] kthread at ffffffff81091a86
#18 [ffff88042b391f48] kernel_thread at ffffffff810141ca

PID: 2321   TASK: ffff88041d074080  CPU: 6   COMMAND: "udevd"
 #0 [ffff880028387e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff880028387e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff880028387ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff880028387ee0] notify_die at ffffffff81096fce
 #4 [ffff880028387f10] do_nmi at ffffffff814cc60c
 #5 [ffff880028387f50] nmi at ffffffff814cbf00
    [exception RIP: lock_kernel+46]
    RIP: ffffffff814cb9de  RSP: ffff88041d15bcc8  RFLAGS: 00000287
    RAX: 0000000000000000  RBX: ffff88042dad5758  RCX: 0000000000001e6b
    RDX: 0000000000001e71  RSI: ffff88041fceae00  RDI: ffff88042dad5758
    RBP: ffff88041d15bcc8   R8: ffff88041d1ceb40   R9: ffff88041e30fd80
    R10: ffff88042df24900  R11: 0000000000000002  R12: ffff88042dad5758
    R13: ffff88041fceae00  R14: ffff88041fceae00  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88041d15bcc8] lock_kernel at ffffffff814cb9de
 #7 [ffff88041d15bcd0] memory_open at ffffffff812f15d4
 #8 [ffff88041d15bd00] chrdev_open at ffffffff81171315
 #9 [ffff88041d15bd60] __dentry_open at ffffffff8116a590
#10 [ffff88041d15bdc0] nameidata_to_filp at ffffffff8116a907
#11 [ffff88041d15bde0] do_filp_open at ffffffff8117dafa
#12 [ffff88041d15bf20] do_sys_open at ffffffff8116a339
#13 [ffff88041d15bf70] sys_open at ffffffff8116a450
#14 [ffff88041d15bf80] system_call_fastpath at ffffffff81013172
    RIP: 0000003dfeed3fc0  RSP: 00007fffb93153a8  RFLAGS: 00010246
    RAX: 0000000000000002  RBX: ffffffff81013172  RCX: 0000003dfeed4150
    RDX: 0000000000000000  RSI: 0000000000000002  RDI: 000000000041d4a6
    RBP: 0000000000000000   R8: 000000000000070d   R9: 0000000000000000
    R10: 00007f926de14a70  R11: 0000000000000246  R12: ffffffff8116a450
    R13: ffff88041d15bf78  R14: 000000000164ddb0  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b

PID: 2337   TASK: ffff88041d20ab30  CPU: 7   COMMAND: "multipath"
 #0 [ffff8800283c7e80] crash_nmi_callback at ffffffff8102e036
 #1 [ffff8800283c7e90] notifier_call_chain at ffffffff814ce4d5
 #2 [ffff8800283c7ed0] atomic_notifier_call_chain at ffffffff814ce53a
 #3 [ffff8800283c7ee0] notify_die at ffffffff81096fce
 #4 [ffff8800283c7f10] do_nmi at ffffffff814cc60c
 #5 [ffff8800283c7f50] nmi at ffffffff814cbf00
    [exception RIP: lock_kernel+46]
    RIP: ffffffff814cb9de  RSP: ffff88041d141c98  RFLAGS: 00000283
    RAX: 0000000000000000  RBX: 0000000000a0003a  RCX: 0000000000001e6b
    RDX: 0000000000001e70  RSI: ffff88041d1ceb40  RDI: ffff880429b17758
    RBP: ffff88041d141c98   R8: ffff88042ba79440   R9: ffff88042b76a900
    R10: ffff88042df3c180  R11: 0000000000000002  R12: ffff88041d1ceb40
    R13: ffff880429b17758  R14: ffff88041d1ceb40  R15: 0000000000000000
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
 #6 [ffff88041d141c98] lock_kernel at ffffffff814cb9de
 #7 [ffff88041d141ca0] misc_open at ffffffff812ff424
 #8 [ffff88041d141d00] chrdev_open at ffffffff81171315
 #9 [ffff88041d141d60] __dentry_open at ffffffff8116a590
#10 [ffff88041d141dc0] nameidata_to_filp at ffffffff8116a907
#11 [ffff88041d141de0] do_filp_open at ffffffff8117dafa
#12 [ffff88041d141f20] do_sys_open at ffffffff8116a339
#13 [ffff88041d141f70] sys_open at ffffffff8116a450
#14 [ffff88041d141f80] system_call_fastpath at ffffffff81013172
    RIP: 0000003dff20ed30  RSP: 00007fffd15727c8  RFLAGS: 00010202
    RAX: 0000000000000002  RBX: ffffffff81013172  RCX: 0000003dfeed3ae5
    RDX: 0000000000000a3a  RSI: 0000000000000002  RDI: 00007fffd1572820
    RBP: 00007fffd1572820   R8: 00007f9ff8fc67a0   R9: 0000000000000004
    R10: fffffffffffff6c4  R11: 0000000000000246  R12: ffffffff8116a450
    R13: ffff88041d141f78  R14: 0000000000000000  R15: 0000000000000000
    ORIG_RAX: 0000000000000002  CS: 0033  SS: 002b
crash>

Comment 7 Jeff Moyer 2011-03-14 15:52:45 UTC

(In reply to comment #4)
> Seeing following traces on a system trying to modprobe LPFC.

Hi, Sumeet,

OK, so you tried to reproduce this on an 8 processor box, it seems.  I looked at the vmcore, but the log was filled with what appears to be sysrq output (not entirely sure about that).  Did you actually experience a softlockup?  It's not apparent from what you posted.  Also, there are a lot of 'Device not ready' errors in the logs:

sd 3:0:0:14: [sdq] CDB: Read(10): 28 00 00 00 00 00 00 00 08 00
end_request: I/O error, dev sdq, sector 0
sd 3:0:0:14: [sdq] Device not ready
sd 3:0:0:14: [sdq] Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
sd 3:0:0:14: [sdq] Sense Key : Not Ready [current] 
sd 3:0:0:14: [sdq] Add. Sense: Logical unit not ready, manual intervention required

They are mostly from sdq, though the following disks also show up in the logs (albeit with much lower frequency):

sdh
sdau
sdbn
sdbr
sdcg
sdcl
sdcs
sddo
sddp
sdcn
sdbm

Are you sure this is a symptom of the same problem experienced by the reporter?  How many HBAs were you testing with, and to how many logical volumes?

Comment 8 Jeff Moyer 2011-03-14 16:08:36 UTC

Could the reporter please attach full dmesg/console output, if available?  I'm not sure why the problem is more readily reproducible with HT enabled.  As far as I can tell, the mutex implementation should be immune to starvation.  We could certainly backport the BKL removal patches in this area, but I'd like a better understanding of what's going on, first.

Comment 10 Jeff Moyer 2011-03-14 21:39:03 UTC

So, I didn't realize Sumeet was not trying to reproduce this issue, and I managed to overlook some key aspects, here.  It's clear that the smp_call_function didn't finish, and that that is the code path holding the BKL.

I'm not sure why the smp_call_function is hung.  The running cpu is cpu 2.  Examining its call_function_data shows that all cpus have cleared their bits, and the lock is not held.

crash> p per_cpu__cfd_data
PER-CPU DATA TYPE:
  struct call_function_data per_cpu__cfd_data;
PER-CPU ADDRESSES:
  [0]: ffff8800282120a0
  [1]: ffff8800282520a0
  [2]: ffff8800282920a0   <----  cpu2
  [3]: ffff8800282d20a0
  [4]: ffff8800283120a0
  [5]: ffff8800283520a0
  [6]: ffff8800283920a0
  [7]: ffff8800283d20a0
crash> call_function_data ffff8800282920a0
struct call_function_data {
  csd = {
    list = {
      next = 0xffffffff817311c0, 
      prev = 0xdead000000200200
    }, 
    func = 0xffffffff8119dcb0 <invalidate_bh_lru>, 
    info = 0x0, 
    flags = 0,    <--- CSD_LOCK not set
    priv = 0
  }, 
  refs = {
    counter = 0
  }, 
  cpumask = 0xffff88042e1fda00
}
crash> ptype cpu_online_mask
type = const struct cpumask {
    long unsigned int bits[64];
} * const
crash> p cpu_online_mask
cpu_online_mask = $4 = (const struct cpumask * const) 0xffffffff818a3ba0
crash> rd -x ffffffff818a3ba0
ffffffff818a3ba0:  00000000000000ff 
crash> rd -x 0xffff88042e1fda00   <--- cpumask embedded in the percpu data
ffff88042e1fda00:  0000000000000000 

Don, any ideas how this might happen?  Just to be certain I hadn't mismapped the cpu into the per-cpu array, I also checked all of the other per-cpu objects, and they had the exact same state.

Comment 11 Jeff Moyer 2011-03-14 21:41:19 UTC

Sumeet, how long was the system stuck like this?  Is it responsive otherwise?

Comment 12 Sumeet Gandhare 2011-03-15 09:41:03 UTC

Jeff,


The system gets hung immediately after invoking modprobe lpfc. After this command, the shell never returns the prompt.

The system never returns to the normal state, even after waiting for several hours the system remains unresponsive.

Comment 13 Jeff Moyer 2011-03-15 14:47:24 UTC

Sumeet, is this something that's easily reproducible?  Would the customer be willing to try a kernel with a couple of patches to the smp_call_function path to see if it resolves the problem?

Comment 14 Sumeet Gandhare 2011-03-15 15:59:46 UTC

Jeff,
Problem is easily reproducible on customer setup.  I am sure customer will be ready to try a test kernel. Please could you provide the same.

Thanks,
Sumeet

Comment 15 Jeff Moyer 2011-03-15 18:49:57 UTC

Sumeet,
Please bring those device errors to the customer's attention.  Also, I'm still waiting for an answer to how many HBAs and how many LUNs are in this configuration.

Comment 37 Jeff Moyer 2011-03-28 14:06:31 UTC

Hi, Peter,

We have a customer system with a lot of storage attached.  During module load, the system load average climbs to over 130 and becomes very sluggish (see comment #34).  Most of the activity is related to setting up the sd devices using udev (it will call blkid for every partition, for example).  I'm not sure what to expect from the system when this happens.  It sounds like a case of, "don't do that."  However, I'd like an official story from a scheduler person on what the expected behavior is.

I'll attach the sysrq-t output.

Comment 38 Jeff Moyer 2011-03-28 14:07:32 UTC

Created attachment 488155 [details]
sysrq-t output during high load

Comment 39 Peter Zijlstra 2011-03-28 14:34:22 UTC

There were a number of fixes to kernel/smp.c recently fixing a number of races in that code, one I think very much like the problem reported, I would suggest backporting all patches touching this file.

In particular see commit 723aae2, but all other fixes there are very much needed too.

Comment 40 Jeff Moyer 2011-03-28 15:01:13 UTC

We already did backport a number of fixes (5, I think) for smp_call_function and friends.  That did fix the hung system and got rid of most of the soft lockups (there's still one left that I saw).  The question I had for you was, in the presence of 120 runnable processes performing blkid and the like commands, would you expect the system to be responsive to interactive processes?  I'm sure it's not an easy question to answer.

Comment 41 Peter Zijlstra 2011-03-28 15:13:04 UTC

OK, missed the smp_call_function fixup there.

Right, hard question, that very much depends on what those processes do and what the state of the BKL mess is in that kernel.

So from what I could see of the backtraces in this BZ the blkid stuff is BKL heavy and will thus serialize all 120 processes, if say the TTY subsystem is also still a BKL user (it used to be for a long while, not quite sure of the RHEL6 status but a git grep shows lots of lock_kernel in drivers/char/) then anything tty related will also grind to a halt.

Comment 42 RHEL Program Management 2011-04-04 02:22:13 UTC

Since RHEL 6.1 External Beta has begun, and this bug remains
unresolved, it has been rejected as it is not proposed as
exception or blocker.

Red Hat invites you to ask your support representative to
propose this request, if appropriate and relevant, in the
next release of Red Hat Enterprise Linux.

Comment 52 Rob Evers 2011-04-21 20:09:32 UTC

(In reply to comment #37)
> Hi, Peter,
> 
> We have a customer system with a lot of storage attached.  During module load,
> the system load average climbs to over 130 and becomes very sluggish (see
> comment #34).  Most of the activity is related to setting up the sd devices
> using udev (it will call blkid for every partition, for example).  I'm not sure
> what to expect from the system when this happens.  It sounds like a case of,
> "don't do that."  However, I'd like an official story from a scheduler person
> on what the expected behavior is.
> 
> I'll attach the sysrq-t output.

I looked into this today.  It appears that the lpfc driver is responsible for the sluggishness of loading.

In a host with 100 luns configured, direct comparision between rhel5.6 and rhel6.1-snap4, the rhel6 load took on the order of 20 seconds where the rhel5.6 load took on the order of 2 seconds.

Repeating the experiment with a qlogic adapter, the qlogic adapter loaded under both scenarios in a few seconds.

I am opened a new bug on this specific issue and will get Emulex involved directly:

https://bugzilla.redhat.com/show_bug.cgi?id=698799

Comment 54 James M. Leddy 2011-05-03 17:29:23 UTC

This regressed from RHEL 6.0. We have a workaround, but how did we get into a regression?

Comment 55 Jeff Moyer 2011-05-03 17:55:00 UTC

James,

This bugzilla has been dealing with a problem introduced in RHEL 6.0.  If you see a regression from 6.0 -> 6.1, then I think it's a different issue.  Could you please elaborate on what you have seen?

Comment 56 James M. Leddy 2011-05-03 18:12:38 UTC

Jeff,

I think I might be commenting in the wrong bug. Let me know.

If we're dealing with the specific issue of the lpfc module using the BKL, then I have to get on to another bug. I was speaking to the fact that the emc scsi device handler is loaded in 6.0 and is _not_ in 6.1, which causes all of the blocking on BKL.

Comment 57 Jeff Moyer 2011-05-03 18:23:24 UTC

Hi, James,

This specific bug addresses race conditions in smp_call_function that result in a hung system.  There are separate bugs to address the load order of the modules (scsi_dh_emc, multipath, lpfc) and to address the regression in driver load times for lpfc.

So, in short, I agree.  You are updating the wrong bug.  ;-)

Comment 58 James M. Leddy 2011-05-09 21:02:37 UTC

Okay, moved to bug 690523

Comment 61 Aristeu Rozanski 2011-05-20 20:40:15 UTC

Patch(es) available on kernel-2.6.32-151.el6

Comment 64 RHEL Program Management 2011-05-20 20:59:44 UTC

This request was evaluated by Red Hat Product Management for inclusion
in a Red Hat Enterprise Linux maintenance release. Product Management has 
requested further review of this request by Red Hat Engineering, for potential
inclusion in a Red Hat Enterprise Linux Update release for currently deployed 
products. This request is not yet committed for inclusion in an Update release.

Comment 66 Gris Ge 2011-06-14 01:16:08 UTC

We cannot find a server with 128 CPUs (HT) having HBA connected in Red Hat.

Can we find a partner or customer for testing this bug?

Let me know if we do have devices capable for this bug. Thanks.

Comment 77 errata-xmlrpc 2011-12-06 12:36:48 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2011-1530.html