Bug 990806
Summary: | BUG: soft lockup - CPU#0 stuck for 63s! [killall5:7385] | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Chao Yang <chayang> |
Component: | kernel | Assignee: | Richard Guy Briggs <rbriggs> |
Status: | CLOSED ERRATA | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | high | Docs Contact: | |
Priority: | urgent | ||
Version: | 6.5 | CC: | adaora.onyia, arubin, asanders, ccui, chorn, cww, dhoward, drjones, eparis, gbeshers, hhuang, jamorgan, jane.lv, jdonohue, jeff.burrell, juzhang, jvillalo, jwilleford, lcapitulino, lisa.mitchell, michen, msvoboda, mvadkert, qzhang, randerso, ruwang, rwright, sauchter, sforsber, shuang, sluo, stephan.wiesand, sumeet.keswani, tlavigne, vmware-gos-qa, vpalavarapu, wkfgktua, xiaolong.wang |
Target Milestone: | rc | Keywords: | ZStream |
Target Release: | --- | ||
Hardware: | Unspecified | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.32-422.el6 | Doc Type: | Bug Fix |
Doc Text: |
When the Audit subsystem was under heavy load, it could loop infinitely in the audit_log_start() function instead of failing over to the error recovery code. This could cause soft lockups in the kernel. With this update, the timeout condition in the audit_log_start() function has been modified to properly fail over when necessary.
|
Story Points: | --- |
Clone Of: | Environment: | ||
Last Closed: | 2013-11-21 19:29:39 UTC | Type: | Bug |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 839486, 840898, 888441, 914776, 993793, 1006441, 1017898, 1017903, 1017905, 1045525 |
Comment 3
Luiz Capitulino
2013-08-24 14:38:53 UTC
(In reply to Luiz Capitulino from comment #3) > How many physical CPUs do you have? Can you paste the contents of /proc/cpus > from your host? > # lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 4 Socket(s): 2 NUMA node(s): 2 Vendor ID: GenuineIntel CPU family: 6 Model: 26 Stepping: 5 CPU MHz: 2394.021 BogoMIPS: 4787.26 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 8192K NUMA node0 CPU(s): 0,2,4,6,8,10,12,14 NUMA node1 CPU(s): 1,3,5,7,9,11,13,15 # cat /proc/cpuinfo ... processor : 15 vendor_id : GenuineIntel cpu family : 6 model : 26 model name : Intel(R) Xeon(R) CPU E5530 @ 2.40GHz stepping : 5 cpu MHz : 2394.021 cache size : 8192 KB physical id : 0 siblings : 8 core id : 3 cpu cores : 4 apicid : 7 initial apicid : 7 fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid bogomips : 4787.26 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: > PS: Just downloaded a RHEL6.5 image, will try to reproduce soon. This bug can be reproduced while installing bare metal system. (In reply to chayang from comment #4) > > PS: Just downloaded a RHEL6.5 image, will try to reproduce soon. > This bug can be reproduced while installing bare metal system. But on comment 2 you said you were able to reproduce this with a VM, right? Anyway, I was finally able to get it on a VM. It must be the same issue because I have the same backtrace and it only triggers in the first boot after installation. According to the backtrace, it seems that kauditd is spinning on a spin lock. I'm going to debug this further... (In reply to Luiz Capitulino from comment #5) > (In reply to chayang from comment #4) > > > > PS: Just downloaded a RHEL6.5 image, will try to reproduce soon. > > This bug can be reproduced while installing bare metal system. > > But on comment 2 you said you were able to reproduce this with a VM, right? > Right, it is reproducible with both a bare metal and a VM. > Anyway, I was finally able to get it on a VM. It must be the same issue > because I have the same backtrace and it only triggers in the first boot > after installation. Indeed. > According to the backtrace, it seems that kauditd is spinning on a spin > lock. I'm going to debug this further... I've been on a long way with this bz, here's the latest news. First, I've found out what's happening. The audit code is busy-waiting in audit_log_start(), in the while loop. It spins indefinitely because sleep_time got negative; and, sleep_time got negative because it waited way too much for the kauditd thread to consume pending SKBs. It's the busy-waiting that generates the hang, and the hang turns into a soft lockup. I haven't found out yet why kauditd stops consuming SKBs. Knowing that is the key to solve the problem. There some possible reasons for that, need to investigate. The other important news is that I did manage to reproduce this against latest upstream kernel, so the bug exists there too. I also wrote a simple workaround and posted it upstream for discussion: https://lkml.org/lkml/2013/8/28/626 I don't expect it to be applied, because it doesn't actually fixes the problem. Also, we still get a long pause where we'd get a soft lockup. But a total hang is avoided. We may consider applying the workaround to RHEL6.5 *iif* we get customers impacted by the issue. I'll keep investigating. I've found a relatively simple way to reproduce this bug: 1. Download the readahead-collector program and build it 2. Run it with: # readahead-collector -f 3. From another terminal do: # pkill -SIGSTOP readahead-collector 4. Keep using the system, run top -d1, vmstat -S 1, etc 5. Eventually, you'll get the soft lockup This allowed me to understand what's happening and post a possible fix: http://marc.info/?l=linux-kernel&m=137818375024600&w=2 We also got a different proposal: http://marc.info/?l=linux-kernel&m=137817994623832&w=2 The upstream discussion may still take some time, if you need a quick workaround for this issue you can try disabling audit by appending "audit=0" to the kernel command-line. See bug 1004024, which is likely a dup of this bug. The reporter has identified two patches that may have introduced the regression [kernel] audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE (Oleg Nesterov) [982467 962976] [kernel] audit: avoid negative sleep durations (Oleg Nesterov) [982467 962976] As I'm testing this against upstream's kernel and against RHEL6's kernel, and as I have more than one reproducer, I'm going to build a test matrix otherwise I'm going crazy. Will post the results shortly and then we can discuss our options. Here goes. The -reverts kernel contains the two patches mentioned in comment 9 reverted and the -myfix kernel contains the fix mentioned in comment 8. Kernel read-collector test RHEL6.5 install (comment 8) (original description) 2.6.32-412 soft lockup soft lockup 2.6.32-412-reverts system hangs for 55s system hangs for 55s 2.6.32-412-myfix works works I added the read-collector column just to show that it does have the same behavior as the RHEL6.5 install test. This is what I see with upstream kernel too, btw. According to that table, reverting the commits mentioned in comment 8 doesn't completely fix the problem. I can think of only two reasons for the bug not being hit/reported before: 1. As regular usage doesn't always triggers the problem, people overlooked it when it happened (as the major symptom is a temporary hang, albeit a long one) 2. There's another change in the audit/netlink/kernel/RHEL6 that made the problem more likely to happen Note that even if reverting comment 8's commit fixed the problem, I wouldn't recommend reverting them because I don't believe upstream is going to do that and they seem to fix real bugs. IMO, we should concentrate our efforts on getting a real fix merged on upstream and then backport that. I'll keep working to get the fix merged upstream, but as this an audit issue, I'm going to reassign this to the audit team so that they can handle this in RHEL6 (backport, z-streams, etc) (In reply to Luiz Capitulino from comment #8) > This allowed me to understand what's happening and post a possible fix: > > http://marc.info/?l=linux-kernel&m=137818375024600&w=2 https://lkml.org/lkml/2013/9/3/4 > We also got a different proposal: > > http://marc.info/?l=linux-kernel&m=137817994623832&w=2 https://lkml.org/lkml/2013/9/2/471 The cause of this bug is https://lkml.org/lkml/2013/1/3/394 http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=82919919 Patchset posted upstream to lkml and linux-audt: https://lkml.org/lkml/2013/9/18/453 https://www.redhat.com/archives/linux-audit/2013-September/msg00024.html *** Bug 1011242 has been marked as a duplicate of this bug. *** Since https://bugzilla.redhat.com/show_bug.cgi?id=1005943 is marked as a blocker for 6.5, and that bug needs the fix from this bug, I am proposing this as a blocker for 6.5 RC. I'd agree that makes sense. OK to add SGI on-site engineers to this bug? Of course. This issue is known and fixed publicly upstream. There does not appear to be any customer sensitive information in the bug report. *** Bug 1008711 has been marked as a duplicate of this bug. *** *** Bug 1005943 has been marked as a duplicate of this bug. *** This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux release for currently deployed products. This request is not yet committed for inclusion in a release. Patch(es) available on kernel-2.6.32-422.el6 *** Bug 1005866 has been marked as a duplicate of this bug. *** *** Bug 1017012 has been marked as a duplicate of this bug. *** *** Bug 1018056 has been marked as a duplicate of this bug. *** I am opening this bug public as there are numerous dups. All public comments a free of private information. Since the problem described in this bug report should be resolved in a recent advisory, it has been closed with a resolution of ERRATA. For information on the advisory, and where to find the updated files, follow the link below. If the solution does not work for you, open a new bug report. http://rhn.redhat.com/errata/RHSA-2013-1645.html Hi, I am also facing the same issue on one of our production servers. Below are more details. Do you also suggest me to upgrade the kernel in my server? OS: RHEL 6.3 Kernel version: 2.6.32-279.el6.x86_64 Regards Venkat Palavarapu Hi, I am also facing the same issue on one of our production servers. Below are more details. Do you also suggest me to upgrade the kernel in my server? OS: RHEL 6.3 Kernel version: 2.6.32-279.el6.x86_64 Regards Venkat Palavarapu Hello, everyone I am also facing the similar case on RHEL v7 on ppc64. Below are more details. Do you advice me to solve symptoms? it only occur during booting time. Red Hat Enterprise Linux Server release 7.0 (Maipo) Kernel version : 3.10.0-123.el7.ppc64 [ 48.601422] BUG: soft lockup - CPU#40 stuck for 23s! [kworker/40:0:529] [ 48.601440] Modules linked in: nx_crypto pseries_rng mlx4_core(+) tg3(+) ses enclosure ptp pps_core uinput binfmt_misc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common usb_storage ipr libata dm_mirror dm_region_hash dm_log dm_mod [ 48.601472] CPU: 40 PID: 529 Comm: kworker/40:0 Not tainted 3.10.0-123.el7.ppc64 #1 [ 48.601482] Workqueue: events .work_for_cpu_fn [ 48.601486] task: c000003e35b0d5d0 ti: c000003e359d8000 task.ti: c000003e359d8000 [ 48.601489] NIP: c000000000010280 LR: c000000000010280 CTR: 0000000000205c34 [ 48.601493] REGS: c000003e359db2f0 TRAP: 0901 Not tainted (3.10.0-123.el7.ppc64) [ 48.601496] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI> CR: 24002024 XER: 2000000b [ 48.601504] SOFTE: 1 [ 48.601506] CFAR: 000000000067aecc [ 48.601508] GPR00: c00000000007e3c4 c000003e359db570 c000000001275448 0000000000000900 GPR04: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 GPR08: 0000000000000000 000000000f0b5f80 0000000000004ec0 0000000000205c34 GPR12: 0000000000003438 c000000007d7a000 [ 48.601527] NIP [c000000000010280] .arch_local_irq_restore+0xf0/0x150 [ 48.601530] LR [c000000000010280] .arch_local_irq_restore+0xf0/0x150 [ 48.601533] PACATMSCRATCH [8000000100009032] [ 48.601535] Call Trace: [ 48.601538] [c000003e359db570] [5355425359535445] 0x5355425359535445 (unreliable) [ 48.601544] [c000003e359db5e0] [c00000000007e3c4] .tce_setrange_multi_pSeriesLP+0x134/0x1b0 [ 48.601548] [c000003e359db6a0] [c000000000050ff8] .walk_system_ram_range+0xc8/0x120 [ 48.601552] [c000003e359db740] [c00000000007ea34] .enable_ddw+0x5e4/0x770 [ 48.601556] [c000003e359db8a0] [c00000000007fb18] .dma_set_mask_pSeriesLP+0x1e8/0x260 [ 48.601561] [c000003e359db940] [c000000000024004] .dma_set_mask+0x54/0x130 [ 48.601573] [c000003e359db9c0] [d000000063c86108] .__mlx4_init_one+0x168/0xeb0 [mlx4_core] [ 48.601579] [c000003e359dba80] [c0000000004b3540] .local_pci_probe+0x60/0x130 [ 48.601582] [c000003e359dbb20] [c0000000000dbb20] .work_for_cpu_fn+0x30/0x50 [ 48.601586] [c000003e359dbba0] [c0000000000e0420] .process_one_work+0x1d0/0x680 [ 48.601590] [c000003e359dbc50] [c0000000000e0cbc] .worker_thread+0x3ec/0x500 [ 48.601594] [c000003e359dbd30] [c0000000000ebb98] .kthread+0xe8/0xf0 [ 48.601598] [c000003e359dbe30] [c00000000000a168] .ret_from_kernel_thread+0x5c/0x74 [ 48.601601] Instruction dump: [ 48.601603] e9228120 e9290000 e9290010 792807e3 4082ff74 38600a00 4bffff6c 60420000 [ 48.601609] 7c0802a6 f8010010 f821ff91 4bff1d31 <60000000> 38210070 e8010010 7c0803a6 (In reply to wkfgktua from comment #43) > Hello, everyone > > I am also facing the similar case on RHEL v7 on ppc64. Below are more > details. Do you advice me to solve symptoms? it only occur during booting > time. > > > Red Hat Enterprise Linux Server release 7.0 (Maipo) > Kernel version : 3.10.0-123.el7.ppc64 > > [ 48.601422] BUG: soft lockup - CPU#40 stuck for 23s! [kworker/40:0:529] > [ 48.601440] Modules linked in: nx_crypto pseries_rng mlx4_core(+) tg3(+) > ses enclosure ptp pps_core uinput binfmt_misc xfs libcrc32c sr_mod cdrom > sd_mod crc_t10dif crct10dif_common usb_storage ipr libata dm_mirror > dm_region_hash dm_log dm_mod > [ 48.601472] CPU: 40 PID: 529 Comm: kworker/40:0 Not tainted > 3.10.0-123.el7.ppc64 #1 > [ 48.601482] Workqueue: events .work_for_cpu_fn > [ 48.601486] task: c000003e35b0d5d0 ti: c000003e359d8000 task.ti: > c000003e359d8000 > [ 48.601489] NIP: c000000000010280 LR: c000000000010280 CTR: > 0000000000205c34 > [ 48.601493] REGS: c000003e359db2f0 TRAP: 0901 Not tainted > (3.10.0-123.el7.ppc64) > [ 48.601496] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI> CR: 24002024 XER: > 2000000b > [ 48.601504] SOFTE: 1 > [ 48.601506] CFAR: 000000000067aecc > [ 48.601508] > GPR00: c00000000007e3c4 c000003e359db570 c000000001275448 0000000000000900 > GPR04: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 > GPR08: 0000000000000000 000000000f0b5f80 0000000000004ec0 0000000000205c34 > GPR12: 0000000000003438 c000000007d7a000 > [ 48.601527] NIP [c000000000010280] .arch_local_irq_restore+0xf0/0x150 > [ 48.601530] LR [c000000000010280] .arch_local_irq_restore+0xf0/0x150 > [ 48.601533] PACATMSCRATCH [8000000100009032] > [ 48.601535] Call Trace: > [ 48.601538] [c000003e359db570] [5355425359535445] 0x5355425359535445 > (unreliable) > [ 48.601544] [c000003e359db5e0] [c00000000007e3c4] > .tce_setrange_multi_pSeriesLP+0x134/0x1b0 > [ 48.601548] [c000003e359db6a0] [c000000000050ff8] > .walk_system_ram_range+0xc8/0x120 > [ 48.601552] [c000003e359db740] [c00000000007ea34] .enable_ddw+0x5e4/0x770 > [ 48.601556] [c000003e359db8a0] [c00000000007fb18] > .dma_set_mask_pSeriesLP+0x1e8/0x260 > [ 48.601561] [c000003e359db940] [c000000000024004] .dma_set_mask+0x54/0x130 > [ 48.601573] [c000003e359db9c0] [d000000063c86108] > .__mlx4_init_one+0x168/0xeb0 [mlx4_core] > [ 48.601579] [c000003e359dba80] [c0000000004b3540] > .local_pci_probe+0x60/0x130 > [ 48.601582] [c000003e359dbb20] [c0000000000dbb20] > .work_for_cpu_fn+0x30/0x50 > [ 48.601586] [c000003e359dbba0] [c0000000000e0420] > .process_one_work+0x1d0/0x680 > [ 48.601590] [c000003e359dbc50] [c0000000000e0cbc] > .worker_thread+0x3ec/0x500 > [ 48.601594] [c000003e359dbd30] [c0000000000ebb98] .kthread+0xe8/0xf0 > [ 48.601598] [c000003e359dbe30] [c00000000000a168] > .ret_from_kernel_thread+0x5c/0x74 > [ 48.601601] Instruction dump: > [ 48.601603] e9228120 e9290000 e9290010 792807e3 4082ff74 38600a00 > 4bffff6c 60420000 > [ 48.601609] 7c0802a6 f8010010 f821ff91 4bff1d31 <60000000> 38210070 > e8010010 7c0803a6 This dump does not look related to this bug. Did you intend to add it to bz 1197000? I wonder if someone can help me with this bug.. we have a customer on RHEL 2.6.32-431.20.3.el6 , hence why i do i see this bug (990806) which was fixed in 2.6.32-422.el6. Are we missing something or has this regressed? [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] ..... ..... [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP ProLiant DL380p Gen8 [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] _spin_lock+0x1e/0x30 (In reply to Sumeet Keswani from comment #45) > I wonder if someone can help me with this bug.. > > we have a customer on RHEL 2.6.32-431.20.3.el6 , hence why i do i see this > bug (990806) which was fixed in 2.6.32-422.el6. Are we missing something or > has this regressed? > > [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] > ..... > ..... > [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 > #1 HP ProLiant DL380p Gen8 > [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] > _spin_lock+0x1e/0x30 Very little information has been provided (the full bug dump is needed at least), but from what I see this does not appear to be the same cause since there was no spin_lock involved in the original case. Here is some more information. The stack is different every time, here are two instances of it. [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] [92776.414485] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support hpilo hpwdt sg power_meter serio_raw tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [92776.414495] CPU 4 [92776.414496] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support hpilo hpwdt sg power_meter serio_raw tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [92776.414506] [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP ProLiant DL380p Gen8 [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] _spin_lock+0x1e/0x30 [92776.414510] RSP: 0018:ffff883f58841468 EFLAGS: 00000297 [92776.414511] RAX: 0000000000009606 RBX: ffff883f58841468 RCX: 0000000000000000 [92776.414512] RDX: 0000000000009605 RSI: ffff883fe40c8240 RDI: ffffffff81e28290 [92776.414513] RBP: ffffffff8100bb8e R08: ffff883fe40c8508 R09: 0000000000000000 [92776.414514] R10: 0000000000013560 R11: 0000000000000000 R12: 0000000000000000 [92776.414515] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 [92776.414516] FS: 00007fc35fb7b7a0(0000) GS:ffff88014c080000(0000) knlGS:0000000000000000 [92776.414518] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [92776.414518] CR2: 00007fc35fb96000 CR3: 0000003fdae3f000 CR4: 00000000001407e0 [92776.414519] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [92776.414520] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 24265] [<ffffffff8152b735>] ? page_fault+0x25/0x30 [1109037.524267] [<ffffffff8152e37e>] ? do_page_fault+0x3e/0xa0 [1109037.524269] [<ffffffff8152b735>] ? page_fault+0x25/0x30 [1109037.528104] BUG: soft lockup - CPU#19 stuck for 67s! [vertica:45049] [1109037.528105] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support serio_raw hpilo hpwdt sg power_meter tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [1109037.528116] CPU 19 [1109037.528117] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support serio_raw hpilo hpwdt sg power_meter tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] [1109037.528127] [1109037.528128] Pid: 45049, comm: vertica Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP (In reply to Sumeet Keswani from comment #47) > Here is some more information. > The stack is different every time, here are two instances of it. Only the first backtrace is reliable. The rest could be caused by random scribbling on core. This might explain why each is very different. > [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] ... > [92776.414495] CPU 4 ... > [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 > #1 HP ProLiant DL380p Gen8 > [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] > _spin_lock+0x1e/0x30 ... > > 24265] [<ffffffff8152b735>] ? page_fault+0x25/0x30 > [1109037.524267] [<ffffffff8152e37e>] ? do_page_fault+0x3e/0xa0 > [1109037.524269] [<ffffffff8152b735>] ? page_fault+0x25/0x30 > [1109037.528104] BUG: soft lockup - CPU#19 stuck for 67s! [vertica:45049] ... > [1109037.528116] CPU 19 ... > [1109037.528128] Pid: 45049, comm: vertica Not tainted > 2.6.32-431.20.3.el6.x86_64 #1 HP I don't see any evidence that this is the same bug. There is no mention of audit_log_start in the backtrace. Please file a new bz. Thanks for looking at this. I will ask the customer to file a new BZ as needed. thanks |