Bug 649519 - Kernel regression causing "SOFT LOCKUP" while building md arrays.
Summary: Kernel regression causing "SOFT LOCKUP" while building md arrays.
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.5
Hardware: x86_64
OS: Linux
low
high
Target Milestone: ---
: ---
Assignee: Jes Sorensen
QA Contact: Red Hat Kernel QE team
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2010-11-03 21:16 UTC by Christopher Owen
Modified: 2012-06-04 06:59 UTC (History)
4 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2012-06-04 06:59:48 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
The script used to build md arrays, setup lvm, format with xfs, and mount. (11.71 KB, text/plain)
2010-11-03 21:18 UTC, Christopher Owen
no flags Details
logs with line wrapping fixed. (11.01 KB, text/plain)
2010-11-03 21:28 UTC, Christopher Owen
no flags Details
dmesg from the x4500 (71.08 KB, text/plain)
2010-11-03 21:47 UTC, Christopher Owen
no flags Details

Description Christopher Owen 2010-11-03 21:16:53 UTC
Description of problem:

Using a Sun Fire x4500 and building 8 soft raid5 arrays of 5 disks, placed into one large lvm group, formatted with xfs... 

Using the kernel version 2.6.18-194.17.4.el5 and using my build script (attached below) the kernel will print many "kernel: BUG SOFT LOCKUP" entries to /var/log/messages and the console.  Eventually the machine will stop responding.  I have waited overnight for the machine to return, but it does not.

Using the kernel version 2.6.18-194.8.1.el5 this does not occur.

Version-Release number of selected component (if applicable):


How reproducible:
every time using the attached script and the latest kernel.

Steps to Reproduce:
1.install rhel 5.5 to /dev/sdac and /dev/sdy (two raid1 md arrays, one 4g swap and one for / with remaining space using ext3)
2.install xfsprogs from ftp://oss.sgi.com/projects/xfs/cmd_srpms-oct_09/
3.run attached script
4.script completes
5.within hours, machine locks up before array build is complete.
  
Actual results:

lockup (no ping, ssh, console), arrays build never completes.

Expected results:

load to exceed 14, machine to become slow, rebuilds to finish.

Additional info:

Each disk is 1TB
This machine runs puppet every two hours.
This machine currently has two disks SMART is complaining about (not being used in arrays.)

example stack trace sent to /var/log/messages
Nov  2 05:14:56 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 23s! [md3_raid5:3583]
Nov  2 05:14:57 ra-nas2 kernel: CPU 3:
Nov  2 05:14:57 ra-nas2 kernel: Modules linked in: cpufreq_ondemand powernow_k8 freq_table bonding ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables loop dm_mirror dm_multipath scsi_dh raid456 xor video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev i2c_amd8111 i2c_amd756 i2c_core sg k8_edac k8temp edac_mc amd_rng pcspkr hwmon serio_raw e1000 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage shpchp sata_mv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Nov  2 05:14:57 ra-nas2 kernel: Pid: 3583, comm: md3_raid5 Not tainted 2.6.18-194.17.4.el5 #1
Nov  2 05:14:57 ra-nas2 kernel: RIP: 0010:[<ffffffff80064b50>]  [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9
Nov  2 05:14:57 ra-nas2 kernel: RSP: 0018:ffff8103fd42ddb8  EFLAGS: 00000287
Nov  2 05:14:57 ra-nas2 kernel: RAX: 0000000000000000 RBX: ffff8101ffda4800 RCX: 0000000000000000
Nov  2 05:14:57 ra-nas2 kernel: RDX: 0000000000000286 RSI: 0000000000000287 RDI: ffff8101ffda4850
Nov  2 05:14:57 ra-nas2 kernel: RBP: ffff8101ff1be528 R08: ffff8101ff1f31f0 R09: 00000001f97d2000
Nov  2 05:14:57 ra-nas2 kernel: R10: 00000000f97d1000 R11: ffff8101ff1f3200 R12: 0000000000000000
Nov  2 05:14:57 ra-nas2 kernel: R13: ffff8101ff1bc0e0 R14: 0000000000000220 R15: ffffffff880cbf30
Nov  2 05:14:57 ra-nas2 kernel: FS:  0000000041475940(0000) GS:ffff8102071b46c0(0000) knlGS:0000000000000000
Nov  2 05:14:57 ra-nas2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Nov  2 05:14:57 ra-nas2 kernel: CR2: 00002aba6e3affe0 CR3: 0000000000201000 CR4: 00000000000006e0
Nov  2 05:14:57 ra-nas2 kernel:
Nov  2 05:14:57 ra-nas2 kernel: Call Trace:
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff88084c70>] :scsi_mod:scsi_dispatch_cmd+0x27d/0x2ff
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8808a174>] :scsi_mod:scsi_request_fn+0x2c1/0x390
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8005a85b>] generic_unplug_device+0x22/0x32
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff88332063>] :raid456:unplug_slaves+0x63/0x99
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8833735f>] :raid456:raid5d+0x175/0x180
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8003acc1>] prepare_to_wait+0x34/0x61
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8021b72e>] md_thread+0xf8/0x10e
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff800a09d4>] autoremove_wake_function+0x0/0x2e
Nov  2 05:14:57 ra-nas2 kernel:  [<ffffffff8021b636>] md_thread+0x0/0x10e
Nov  2 05:14:58 ra-nas2 kernel:  [<ffffffff8003290a>] kthread+0xfe/0x132
Nov  2 05:14:58 ra-nas2 kernel:  [<ffffffff8005dfb1>] child_rip+0xa/0x11
Nov  2 05:14:58 ra-nas2 kernel:  [<ffffffff800a07bc>] keventd_create_kthread+0x0/0xc4
libsmb/cliconnect.c:cli_session_setup_spnego(877)
Nov  2 05:14:58 ra-nas2 kernel:  [<ffffffff8003280c>] kthread+0x0/0x132
Nov  2 05:14:58 ra-nas2 kernel:  [<ffffffff8005dfa7>] child_rip+0x0/0x11

Second example stack trace:
Nov  2 05:14:53 ra-nas2 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [rpmq:5131]
Nov  2 05:14:53 ra-nas2 kernel: CPU 0:
Nov  2 05:14:53 ra-nas2 kernel: Modules linked in: cpufreq_ondemand powernow_k8 freq_table bonding ipv6 xfrm_nalgo crypto_api ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables loop dm_mirror dm_multipath scsi_dh raid456 xor video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev i2c_amd8111 i2c_amd756 i2c_core sg k8_edac k8temp edac_mc amd_rng pcspkr hwmon serio_raw e1000 dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage shpchp sata_mv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Nov  2 05:14:53 ra-nas2 kernel: Pid: 5131, comm: rpmq Not tainted 2.6.18-194.17.4.el5 #1
Nov  2 05:14:53 ra-nas2 kernel: RIP: 0010:[<ffffffff800161f5>]  [<ffffffff800161f5>] __bitmap_empty+0xf/0x62
Nov  2 05:14:53 ra-nas2 kernel: RSP: 0018:ffff8101ed23de18  EFLAGS: 00000212
Nov  2 05:14:53 ra-nas2 kernel: RAX: 0000000000000003 RBX: ffff810001003200 RCX: 00000000000000ff
Nov  2 05:14:53 ra-nas2 kernel: RDX: 000000000000003f RSI: 00000000000000ff RDI: ffff810001003200
Nov  2 05:14:53 ra-nas2 kernel: RBP: 0000000000001000 R08: 0000000000000002 R09: 000000000000003e
Nov  2 05:14:53 ra-nas2 kernel: R10: ffff8101ed23ddc8 R11: ffff8101ed37be48 R12: ffff8101ed23ddf8
Nov  2 05:14:53 ra-nas2 kernel: R13: 0000000000000000 R14: ffffffff8000c639 R15: ffff8101ed23de88
Nov  2 05:14:53 ra-nas2 kernel: FS:  00002b2336b01140(0000) GS:ffffffff803ca000(0000) knlGS:0000000000000000
Nov  2 05:14:53 ra-nas2 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
Nov  2 05:14:53 ra-nas2 kernel: CR2: 00002b2337390000 CR3: 00000001ecdec000 CR4: 00000000000006e0
Nov  2 05:14:53 ra-nas2 kernel:
Nov  2 05:14:53 ra-nas2 kernel: Call Trace:
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff80023132>] flush_tlb_others+0x99/0xbc
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff8007698f>] flush_tlb_mm+0xca/0xd5
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff80012d8e>] unmap_region+0x104/0x12b
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff80011e7f>] do_munmap+0x21b/0x29a
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff800645ab>] __down_write_nested+0x12/0x92
Nov  2 05:14:53 ra-nas2 kernel:  [<ffffffff80016288>] sys_munmap+0x40/0x59
Nov  2 05:14:54 ra-nas2 kernel:  [<ffffffff8005d28d>] tracesys+0xd5/0xe0

Third example stack trace:
Nov  1 08:26:03 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 35s! [swapper:0]
Nov  1 08:26:03 ra-nas2 kernel: CPU 3:
Nov  1 08:26:03 ra-nas2 kernel: Modules linked in: xfs raid456 xor cpufreq_ondemand powernow_k8 freq_table ipt_REJECT xt_tcpudp xt_state iptable_filter iptable_nat ip_nat ip_conntrack nfnetlink ip_tables x_tables loop dm_mirror dm_multipath scsi_dh video backlight sbs power_meter i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport joydev e1000 i2c_amd756 k8temp sg i2c_amd8111 hwmon amd_rng i2c_core k8_edac pcspkr serio_raw edac_mc dm_raid45 dm_message dm_region_hash dm_log dm_mod dm_mem_cache usb_storage shpchp sata_mv libata sd_mod scsi_mod raid1 ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Nov  1 08:26:03 ra-nas2 kernel: Pid: 0, comm: swapper Not tainted 2.6.18-194.17.4.el5 #1
Nov  1 08:26:03 ra-nas2 kernel: RIP: 0010:[<ffffffff80064b50>]  [<ffffffff80064b50>] _spin_unlock_irqrestore+0x8/0x9
Nov  1 08:26:03 ra-nas2 kernel: RSP: 0018:ffff810107143d48  EFLAGS: 00000246
Nov  1 08:26:03 ra-nas2 kernel: RAX: 0000000000000000 RBX: ffff8101ff610800 RCX: 0000000000000001
Nov  1 08:26:03 ra-nas2 kernel: RDX: 0000000000000286 RSI: 0000000000000246 RDI: ffff8101ff610850
Nov  1 08:26:03 ra-nas2 kernel: RBP: ffff810107143cc0 R08: ffff810207204370 R09: ffff8101ff158000
Nov  1 08:26:03 ra-nas2 kernel: R10: ffff8103ec41d6c8 R11: ffffffff80045000 R12: ffffffff8005dc8e
Nov  1 08:26:03 ra-nas2 kernel: R13: ffff8101f4eaa9c0 R14: ffffffff800781ff R15: ffff810107143cc0
Nov  1 08:26:03 ra-nas2 kernel: FS:  00002b985e74d190(0000) GS:ffff8102071b46c0(0000) knlGS:00000000f7f8f6c0
Nov  1 08:26:03 ra-nas2 kernel: CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
Nov  1 08:26:03 ra-nas2 kernel: CR2: 00002aaaae528000 CR3: 00000003fa934000 CR4: 00000000000006e0
Nov  1 08:26:03 ra-nas2 kernel:
Nov  1 08:26:04 ra-nas2 kernel: Call Trace:
Nov  1 08:26:04 ra-nas2 kernel:  <IRQ>  [<ffffffff88084c70>] :scsi_mod:scsi_dispatch_cmd+0x27d/0x2ff
Nov  1 08:26:04 ra-nas2 kernel:  [<ffffffff8808a174>] :scsi_mod:scsi_request_fn+0x2c1/0x390
Nov  1 08:26:04 ra-nas2 kernel:  [<ffffffff8005c223>] blk_run_queue+0x41/0x73
Nov  1 08:26:05 ra-nas2 kernel:  [<ffffffff8808879d>] :scsi_mod:scsi_run_queue+0x155/0x1bf
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff88088efe>] :scsi_mod:scsi_next_command+0x2d/0x39
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff8808907d>] :scsi_mod:scsi_end_request+0xbf/0xcd
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff880891d9>] :scsi_mod:scsi_io_completion+0x14e/0x324
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff880b6802>] :sd_mod:sd_rw_intr+0x252/0x28c
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff8808946e>] :scsi_mod:scsi_device_unbusy+0x67/0x81
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff80037cac>] blk_done_softirq+0x5f/0x6d
Nov  1 08:26:06 ra-nas2 kernel:  [<ffffffff80012443>] __do_softirq+0x89/0x133
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8005e2fc>] call_softirq+0x1c/0x28
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8006cb8a>] do_softirq+0x2c/0x85
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8006ca12>] do_IRQ+0xec/0xf5
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8006b342>] default_idle+0x0/0x50
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8005d615>] ret_from_intr+0x0/0xa
Nov  1 08:26:07 ra-nas2 kernel:  <EOI>  [<ffffffff8005a839>] generic_unplug_device+0x0/0x32
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8006b36b>] default_idle+0x29/0x50
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8004923a>] cpu_idle+0x95/0xb8
Nov  1 08:26:07 ra-nas2 kernel:  [<ffffffff8007796b>] start_secondary+0x498/0x4a7


all lines containing "BUG" from /var/log/messages:
Nov  1 06:47:24 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 64s! [swapper:0]
Nov  1 08:26:03 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 35s! [swapper:0]
Nov  1 08:26:08 ra-nas2 kernel: BUG: soft lockup - CPU#2 stuck for 35s! [md6_resync:9859]
Nov  1 08:26:14 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 35s! [md9_resync:9898]
Nov  1 12:26:12 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 35s! [md5_raid5:9841]
Nov  1 12:26:14 ra-nas2 kernel: BUG: soft lockup - CPU#2 stuck for 35s! [md8_resync:9886]
Nov  1 12:26:16 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 35s! [md5_resync:9845]
Nov  2 05:14:53 ra-nas2 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [rpmq:5131]
Nov  2 05:14:54 ra-nas2 kernel: BUG: soft lockup - CPU#0 stuck for 10s! [rpmq:5131]
Nov  2 05:14:55 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 23s! [md9_resync:3628]
Nov  2 05:14:56 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 23s! [md3_raid5:3583]
Nov  2 05:14:58 ra-nas2 kernel: BUG: soft lockup - CPU#2 stuck for 23s! [md3_resync:3585]
Nov  2 13:14:41 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:38 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:40 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:46 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:47 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:49 ra-nas2 kernel: BUG: soft lockup - CPU#3 stuck for 10s! [md3_raid5:24146]
Nov  2 13:15:54 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 65s! [smartd:5143]
Nov  2 13:15:55 ra-nas2 kernel: BUG: soft lockup - CPU#2 stuck for 65s! [md2_resync:24141]
Nov  3 06:34:31 ra-nas2 kernel: BUG: soft lockup - CPU#1 stuck for 64s! [swapper:0]

Comment 1 Christopher Owen 2010-11-03 21:18:13 UTC
Created attachment 457589 [details]
The script used to build md arrays, setup lvm, format with xfs, and mount.

Comment 2 Christopher Owen 2010-11-03 21:28:56 UTC
Created attachment 457590 [details]
logs with line wrapping fixed.

Comment 3 Christopher Owen 2010-11-03 21:30:17 UTC
Additional non-standard packages on this machine include ipmitool and hd from Sun's Driver and Tools disk for x4500 servers version 1.7.

Comment 4 Christopher Owen 2010-11-03 21:31:20 UTC
One last note: this machine will have to go back into production next week so my ability to test will be reduced by the end of this week.

Comment 5 Christopher Owen 2010-11-03 21:47:09 UTC
Created attachment 457596 [details]
dmesg from the x4500

Comment 6 Christopher Owen 2010-11-04 00:06:32 UTC
Sorry, previously stated 2.6.18-194.8.1.el5 also does not work.

I am now trying kernel-2.6.18-164.15.1.el5 (the last 5.4 kernel)

Comment 7 Christopher Owen 2010-11-04 21:07:01 UTC
kernel-2.6.18-164.15.1.el5 worked as expected.

Comment 8 Christopher Owen 2010-12-20 19:47:09 UTC
I would appreciate advice on how to proceed.  I'm a little stuck at the moment.

Comment 9 Jes Sorensen 2012-05-08 14:02:17 UTC
Is this still an issue for you?


Note You need to log in before you can comment on or make changes to this bug.