990806 – BUG: soft lockup - CPU#0 stuck for 63s! [killall5:7385]

RHEL Engineering is moving the tracking of its product development work on RHEL 6 through RHEL 9 to Red Hat Jira (issues.redhat.com). If you're a Red Hat customer, please continue to file support cases via the Red Hat customer portal. If you're not, please head to the "RHEL project" in Red Hat Jira and file new tickets here. Individual Bugzilla bugs in the statuses "NEW", "ASSIGNED", and "POST" are being migrated throughout September 2023. Bugs of Red Hat partners with an assigned Engineering Partner Manager (EPM) are migrated in late September as per pre-agreed dates. Bugs against components "kernel", "kernel-rt", and "kpatch" are only migrated if still in "NEW" or "ASSIGNED". If you cannot log in to RH Jira, please consult article #7032570. That failing, please send an e-mail to the RH Jira admins at rh-issues@redhat.com to troubleshoot your issue as a user management inquiry. The email creates a ServiceNow ticket with Red Hat. Individual Bugzilla bugs that are migrated will be moved to status "CLOSED", resolution "MIGRATED", and set with "MigratedToJIRA" in "Keywords". The link to the successor Jira issue will be found under "Links", have a little "two-footprint" icon next to it, and direct you to the "RHEL project" in Red Hat Jira (issue links are of type "https://issues.redhat.com/browse/RHEL-XXXX", where "X" is a digit). This same link will be available in a blue banner at the top of the page informing you that that bug has been migrated.

Bug 990806 - BUG: soft lockup - CPU#0 stuck for 63s! [killall5:7385]

Summary: BUG: soft lockup - CPU#0 stuck for 63s! [killall5:7385]

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Red Hat Enterprise Linux 6
Classification:	Red Hat
Component:	kernel
Sub Component:
Version:	6.5
Hardware:	Unspecified
OS:	Unspecified
Priority:	urgent
Severity:	high
Target Milestone:	rc
Target Release:	---
Assignee:	Richard Guy Briggs
QA Contact:	Red Hat Kernel QE team
Docs Contact:
URL:
Whiteboard:
Duplicates (6):	1005866 1005943 1008711 1011242 1017012 1018056 (view as bug list)
Depends On:
Blocks:	839486 840898 888441 914776 993793 1006441 1017898 1017903 1017905 1045525
TreeView+	depends on / blocked

Reported:	2013-08-01 03:42 UTC by Chao Yang
Modified:	2018-12-06 15:11 UTC (History)
CC List:	38 users (show)
Fixed In Version:	kernel-2.6.32-422.el6
Doc Type:	Bug Fix
Doc Text:	When the Audit subsystem was under heavy load, it could loop infinitely in the audit_log_start() function instead of failing over to the error recovery code. This could cause soft lockups in the kernel. With this update, the timeout condition in the audit_log_start() function has been modified to properly fail over when necessary.
Clone Of:
Environment:
Last Closed:	2013-11-21 19:29:39 UTC
Target Upstream Version:
Embargoed:

Attachments	(Terms of Use)

Links
System	ID	Private	Priority	Status	Summary	Last Updated
Red Hat Knowledge Base (Solution)	502603	0	None	None	None	Never
Red Hat Product Errata	RHSA-2013:1645	0	normal	SHIPPED_LIVE	Important: Red Hat Enterprise Linux 6 kernel update	2013-11-20 22:04:18 UTC

Comment 3 Luiz Capitulino 2013-08-24 14:38:53 UTC

How many physical CPUs do you have? Can you paste the contents of /proc/cpus from your host?

PS: Just downloaded a RHEL6.5 image, will try to reproduce soon.

Comment 4 Chao Yang 2013-08-26 01:50:29 UTC

(In reply to Luiz Capitulino from comment #3)
> How many physical CPUs do you have? Can you paste the contents of /proc/cpus
> from your host?
> 
# lscpu 
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                16
On-line CPU(s) list:   0-15
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 26
Stepping:              5
CPU MHz:               2394.021
BogoMIPS:              4787.26
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              8192K
NUMA node0 CPU(s):     0,2,4,6,8,10,12,14
NUMA node1 CPU(s):     1,3,5,7,9,11,13,15

# cat /proc/cpuinfo
...
processor	: 15
vendor_id	: GenuineIntel
cpu family	: 6
model		: 26
model name	: Intel(R) Xeon(R) CPU           E5530  @ 2.40GHz
stepping	: 5
cpu MHz		: 2394.021
cache size	: 8192 KB
physical id	: 0
siblings	: 8
core id		: 3
cpu cores	: 4
apicid		: 7
initial apicid	: 7
fpu		: yes
fpu_exception	: yes
cpuid level	: 11
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm ida dts tpr_shadow vnmi flexpriority ept vpid
bogomips	: 4787.26
clflush size	: 64
cache_alignment	: 64
address sizes	: 40 bits physical, 48 bits virtual
power management:

> PS: Just downloaded a RHEL6.5 image, will try to reproduce soon.
This bug can be reproduced while installing bare metal system.

Comment 5 Luiz Capitulino 2013-08-26 22:07:18 UTC

(In reply to chayang from comment #4)

> > PS: Just downloaded a RHEL6.5 image, will try to reproduce soon.
> This bug can be reproduced while installing bare metal system.

But on comment 2 you said you were able to reproduce this with a VM, right?

Anyway, I was finally able to get it on a VM. It must be the same issue because I have the same backtrace and it only triggers in the first boot after installation.

According to the backtrace, it seems that kauditd is spinning on a spin lock. I'm going to debug this further...

Comment 6 Chao Yang 2013-08-27 01:40:00 UTC

(In reply to Luiz Capitulino from comment #5)
> (In reply to chayang from comment #4)
> 
> > > PS: Just downloaded a RHEL6.5 image, will try to reproduce soon.
> > This bug can be reproduced while installing bare metal system.
> 
> But on comment 2 you said you were able to reproduce this with a VM, right?
> 
Right, it is reproducible with both a bare metal and a VM.

> Anyway, I was finally able to get it on a VM. It must be the same issue
> because I have the same backtrace and it only triggers in the first boot
> after installation.
Indeed.

> According to the backtrace, it seems that kauditd is spinning on a spin
> lock. I'm going to debug this further...

Comment 7 Luiz Capitulino 2013-08-29 03:30:31 UTC

I've been on a long way with this bz, here's the latest news.

First, I've found out what's happening. The audit code is busy-waiting in audit_log_start(), in the while loop. It spins indefinitely because sleep_time got negative; and, sleep_time got negative because it waited way too much for the kauditd thread to consume pending SKBs. It's the busy-waiting that generates the hang, and the hang turns into a soft lockup.

I haven't found out yet why kauditd stops consuming SKBs. Knowing that is the key to solve the problem. There some possible reasons for that, need to investigate.

The other important news is that I did manage to reproduce this against latest upstream kernel, so the bug exists there too.

I also wrote a simple workaround and posted it upstream for discussion:

https://lkml.org/lkml/2013/8/28/626

I don't expect it to be applied, because it doesn't actually fixes the problem. Also, we still get a long pause where we'd get a soft lockup. But a total hang is avoided. We may consider applying the workaround to RHEL6.5 *iif* we get customers impacted by the issue.

I'll keep investigating.

Comment 8 Luiz Capitulino 2013-09-03 13:41:02 UTC

I've found a relatively simple way to reproduce this bug:

1. Download the readahead-collector program and build it
2. Run it with:
   # readahead-collector -f
3. From another terminal do:
   # pkill -SIGSTOP readahead-collector
4. Keep using the system, run top -d1, vmstat -S 1, etc
5. Eventually, you'll get the soft lockup

This allowed me to understand what's happening and post a possible fix:

http://marc.info/?l=linux-kernel&m=137818375024600&w=2

We also got a different proposal:

http://marc.info/?l=linux-kernel&m=137817994623832&w=2

The upstream discussion may still take some time, if you need a quick workaround for this issue you can try disabling audit by appending "audit=0" to the kernel command-line.

Comment 9 Andrew Jones 2013-09-04 08:41:07 UTC

See bug 1004024, which is likely a dup of this bug. The reporter has identified two patches that may have introduced the regression

[kernel] audit: wait_for_auditd() should use TASK_UNINTERRUPTIBLE (Oleg Nesterov) [982467 962976]
[kernel] audit: avoid negative sleep durations (Oleg Nesterov) [982467 962976]

Comment 10 Luiz Capitulino 2013-09-04 14:56:03 UTC

As I'm testing this against upstream's kernel and against RHEL6's kernel, and as I have more than one reproducer, I'm going to build a test matrix otherwise I'm going crazy. Will post the results shortly and then we can discuss our options.

Comment 11 Luiz Capitulino 2013-09-04 15:55:55 UTC

Here goes. The -reverts kernel contains the two patches mentioned in comment 9 reverted and the -myfix kernel contains the fix mentioned in comment 8.


Kernel                      read-collector test        RHEL6.5 install
                            (comment 8)                (original description)


2.6.32-412                  soft lockup                soft lockup
2.6.32-412-reverts          system hangs for 55s       system hangs for 55s
2.6.32-412-myfix            works                      works

I added the read-collector column just to show that it does have the same behavior as the RHEL6.5 install test. This is what I see with upstream kernel too, btw.

According to that table, reverting the commits mentioned in comment 8 doesn't completely fix the problem. I can think of only two reasons for the bug not being hit/reported before:

1. As regular usage doesn't always triggers the problem, people overlooked it when it happened (as the major symptom is a temporary hang, albeit a long one)

2. There's another change in the audit/netlink/kernel/RHEL6 that made the problem more likely to happen

Note that even if reverting comment 8's commit fixed the problem, I wouldn't recommend reverting them because I don't believe upstream is going to do that and they seem to fix real bugs.

IMO, we should concentrate our efforts on getting a real fix merged on upstream and then backport that.

Comment 12 Luiz Capitulino 2013-09-05 13:21:28 UTC

I'll keep working to get the fix merged upstream, but as this an audit issue, I'm going to reassign this to the audit team so that they can handle this in RHEL6 (backport, z-streams, etc)

Comment 13 Richard Guy Briggs 2013-09-12 18:05:47 UTC

(In reply to Luiz Capitulino from comment #8)
> This allowed me to understand what's happening and post a possible fix:
> 
> http://marc.info/?l=linux-kernel&m=137818375024600&w=2

https://lkml.org/lkml/2013/9/3/4

> We also got a different proposal:
> 
> http://marc.info/?l=linux-kernel&m=137817994623832&w=2

https://lkml.org/lkml/2013/9/2/471

The cause of this bug is
https://lkml.org/lkml/2013/1/3/394
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=82919919

Comment 14 Richard Guy Briggs 2013-09-18 22:41:33 UTC

Patchset posted upstream to lkml and linux-audt:
    https://lkml.org/lkml/2013/9/18/453
    https://www.redhat.com/archives/linux-audit/2013-September/msg00024.html

Comment 15 Prarit Bhargava 2013-09-24 18:23:42 UTC

*** Bug 1011242 has been marked as a duplicate of this bug. ***

Comment 16 Suzanne Forsberg 2013-09-26 15:43:00 UTC

Since https://bugzilla.redhat.com/show_bug.cgi?id=1005943 is marked as a blocker for 6.5, and that bug needs the fix from this bug, I am proposing this as a blocker for 6.5 RC.

Comment 17 Richard Guy Briggs 2013-09-26 15:58:55 UTC

I'd agree that makes sense.

Comment 18 Joe Donohue 2013-09-30 14:04:42 UTC

OK to add SGI on-site engineers to this bug?

Comment 19 Eric Paris 2013-09-30 22:03:02 UTC

Of course.  This issue is known and fixed publicly upstream.  There does not appear to be any customer sensitive information in the bug report.

Comment 20 Eric Paris 2013-10-01 19:39:52 UTC

*** Bug 1008711 has been marked as a duplicate of this bug. ***

Comment 21 Eric Paris 2013-10-01 19:42:05 UTC

*** Bug 1005943 has been marked as a duplicate of this bug. ***

Comment 22 RHEL Program Management 2013-10-01 20:15:19 UTC

This request was evaluated by Red Hat Product Management for
inclusion in a Red Hat Enterprise Linux release.  Product
Management has requested further review of this request by
Red Hat Engineering, for potential inclusion in a Red Hat
Enterprise Linux release for currently deployed products.
This request is not yet committed for inclusion in a release.

Comment 25 Rafael Aquini 2013-10-09 03:12:16 UTC

Patch(es) available on kernel-2.6.32-422.el6

Comment 31 Prarit Bhargava 2013-10-11 10:42:06 UTC

*** Bug 1005866 has been marked as a duplicate of this bug. ***

Comment 32 Prarit Bhargava 2013-10-11 10:55:12 UTC

*** Bug 1017012 has been marked as a duplicate of this bug. ***

Comment 33 Eric Paris 2013-10-11 13:50:07 UTC

*** Bug 1018056 has been marked as a duplicate of this bug. ***

Comment 34 Eric Paris 2013-10-11 18:15:07 UTC

I am opening this bug public as there are numerous dups.  All public comments a free of private information.

Comment 40 errata-xmlrpc 2013-11-21 19:29:39 UTC

Since the problem described in this bug report should be
resolved in a recent advisory, it has been closed with a
resolution of ERRATA.

For information on the advisory, and where to find the updated
files, follow the link below.

If the solution does not work for you, open a new bug report.

http://rhn.redhat.com/errata/RHSA-2013-1645.html

Comment 41 Venkat Palavarapu 2014-04-23 11:09:06 UTC

Hi,

I am also facing the same issue on one of our production servers. Below are more details. Do you also suggest me to upgrade the kernel in my server?

OS: RHEL 6.3
Kernel version: 2.6.32-279.el6.x86_64


Regards
Venkat Palavarapu

Comment 42 Venkat Palavarapu 2014-04-23 11:09:43 UTC

Hi,

I am also facing the same issue on one of our production servers. Below are more details. Do you also suggest me to upgrade the kernel in my server?

OS: RHEL 6.3
Kernel version: 2.6.32-279.el6.x86_64


Regards
Venkat Palavarapu

Comment 43 wkfgktua 2015-07-30 03:52:49 UTC

Hello, everyone

I am also facing the similar case on RHEL v7 on ppc64. Below are more details. Do you advice me to solve symptoms? it only occur during booting time.


Red Hat Enterprise Linux Server release 7.0 (Maipo)
Kernel version : 3.10.0-123.el7.ppc64

[   48.601422] BUG: soft lockup - CPU#40 stuck for 23s! [kworker/40:0:529]
[   48.601440] Modules linked in: nx_crypto pseries_rng mlx4_core(+) tg3(+) ses enclosure ptp pps_core uinput binfmt_misc xfs libcrc32c sr_mod cdrom sd_mod crc_t10dif crct10dif_common usb_storage ipr libata dm_mirror dm_region_hash dm_log dm_mod
[   48.601472] CPU: 40 PID: 529 Comm: kworker/40:0 Not tainted 3.10.0-123.el7.ppc64 #1
[   48.601482] Workqueue: events .work_for_cpu_fn
[   48.601486] task: c000003e35b0d5d0 ti: c000003e359d8000 task.ti: c000003e359d8000
[   48.601489] NIP: c000000000010280 LR: c000000000010280 CTR: 0000000000205c34
[   48.601493] REGS: c000003e359db2f0 TRAP: 0901   Not tainted  (3.10.0-123.el7.ppc64)
[   48.601496] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI>  CR: 24002024  XER: 2000000b
[   48.601504] SOFTE: 1
[   48.601506] CFAR: 000000000067aecc
[   48.601508] 
GPR00: c00000000007e3c4 c000003e359db570 c000000001275448 0000000000000900 
GPR04: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 
GPR08: 0000000000000000 000000000f0b5f80 0000000000004ec0 0000000000205c34 
GPR12: 0000000000003438 c000000007d7a000 
[   48.601527] NIP [c000000000010280] .arch_local_irq_restore+0xf0/0x150
[   48.601530] LR [c000000000010280] .arch_local_irq_restore+0xf0/0x150
[   48.601533] PACATMSCRATCH [8000000100009032]
[   48.601535] Call Trace:
[   48.601538] [c000003e359db570] [5355425359535445] 0x5355425359535445 (unreliable)
[   48.601544] [c000003e359db5e0] [c00000000007e3c4] .tce_setrange_multi_pSeriesLP+0x134/0x1b0
[   48.601548] [c000003e359db6a0] [c000000000050ff8] .walk_system_ram_range+0xc8/0x120
[   48.601552] [c000003e359db740] [c00000000007ea34] .enable_ddw+0x5e4/0x770
[   48.601556] [c000003e359db8a0] [c00000000007fb18] .dma_set_mask_pSeriesLP+0x1e8/0x260
[   48.601561] [c000003e359db940] [c000000000024004] .dma_set_mask+0x54/0x130
[   48.601573] [c000003e359db9c0] [d000000063c86108] .__mlx4_init_one+0x168/0xeb0 [mlx4_core]
[   48.601579] [c000003e359dba80] [c0000000004b3540] .local_pci_probe+0x60/0x130
[   48.601582] [c000003e359dbb20] [c0000000000dbb20] .work_for_cpu_fn+0x30/0x50
[   48.601586] [c000003e359dbba0] [c0000000000e0420] .process_one_work+0x1d0/0x680
[   48.601590] [c000003e359dbc50] [c0000000000e0cbc] .worker_thread+0x3ec/0x500
[   48.601594] [c000003e359dbd30] [c0000000000ebb98] .kthread+0xe8/0xf0
[   48.601598] [c000003e359dbe30] [c00000000000a168] .ret_from_kernel_thread+0x5c/0x74
[   48.601601] Instruction dump:
[   48.601603] e9228120 e9290000 e9290010 792807e3 4082ff74 38600a00 4bffff6c 60420000 
[   48.601609] 7c0802a6 f8010010 f821ff91 4bff1d31 <60000000> 38210070 e8010010 7c0803a6

Comment 44 Richard Guy Briggs 2015-07-31 09:46:45 UTC

(In reply to wkfgktua from comment #43)
> Hello, everyone
> 
> I am also facing the similar case on RHEL v7 on ppc64. Below are more
> details. Do you advice me to solve symptoms? it only occur during booting
> time.
> 
> 
> Red Hat Enterprise Linux Server release 7.0 (Maipo)
> Kernel version : 3.10.0-123.el7.ppc64
> 
> [   48.601422] BUG: soft lockup - CPU#40 stuck for 23s! [kworker/40:0:529]
> [   48.601440] Modules linked in: nx_crypto pseries_rng mlx4_core(+) tg3(+)
> ses enclosure ptp pps_core uinput binfmt_misc xfs libcrc32c sr_mod cdrom
> sd_mod crc_t10dif crct10dif_common usb_storage ipr libata dm_mirror
> dm_region_hash dm_log dm_mod
> [   48.601472] CPU: 40 PID: 529 Comm: kworker/40:0 Not tainted
> 3.10.0-123.el7.ppc64 #1
> [   48.601482] Workqueue: events .work_for_cpu_fn
> [   48.601486] task: c000003e35b0d5d0 ti: c000003e359d8000 task.ti:
> c000003e359d8000
> [   48.601489] NIP: c000000000010280 LR: c000000000010280 CTR:
> 0000000000205c34
> [   48.601493] REGS: c000003e359db2f0 TRAP: 0901   Not tainted 
> (3.10.0-123.el7.ppc64)
> [   48.601496] MSR: 8000000100009032 <SF,EE,ME,IR,DR,RI>  CR: 24002024  XER:
> 2000000b
> [   48.601504] SOFTE: 1
> [   48.601506] CFAR: 000000000067aecc
> [   48.601508] 
> GPR00: c00000000007e3c4 c000003e359db570 c000000001275448 0000000000000900 
> GPR04: 0000000000000000 0000000000000001 0000000000000000 0000000000000001 
> GPR08: 0000000000000000 000000000f0b5f80 0000000000004ec0 0000000000205c34 
> GPR12: 0000000000003438 c000000007d7a000 
> [   48.601527] NIP [c000000000010280] .arch_local_irq_restore+0xf0/0x150
> [   48.601530] LR [c000000000010280] .arch_local_irq_restore+0xf0/0x150
> [   48.601533] PACATMSCRATCH [8000000100009032]
> [   48.601535] Call Trace:
> [   48.601538] [c000003e359db570] [5355425359535445] 0x5355425359535445
> (unreliable)
> [   48.601544] [c000003e359db5e0] [c00000000007e3c4]
> .tce_setrange_multi_pSeriesLP+0x134/0x1b0
> [   48.601548] [c000003e359db6a0] [c000000000050ff8]
> .walk_system_ram_range+0xc8/0x120
> [   48.601552] [c000003e359db740] [c00000000007ea34] .enable_ddw+0x5e4/0x770
> [   48.601556] [c000003e359db8a0] [c00000000007fb18]
> .dma_set_mask_pSeriesLP+0x1e8/0x260
> [   48.601561] [c000003e359db940] [c000000000024004] .dma_set_mask+0x54/0x130
> [   48.601573] [c000003e359db9c0] [d000000063c86108]
> .__mlx4_init_one+0x168/0xeb0 [mlx4_core]
> [   48.601579] [c000003e359dba80] [c0000000004b3540]
> .local_pci_probe+0x60/0x130
> [   48.601582] [c000003e359dbb20] [c0000000000dbb20]
> .work_for_cpu_fn+0x30/0x50
> [   48.601586] [c000003e359dbba0] [c0000000000e0420]
> .process_one_work+0x1d0/0x680
> [   48.601590] [c000003e359dbc50] [c0000000000e0cbc]
> .worker_thread+0x3ec/0x500
> [   48.601594] [c000003e359dbd30] [c0000000000ebb98] .kthread+0xe8/0xf0
> [   48.601598] [c000003e359dbe30] [c00000000000a168]
> .ret_from_kernel_thread+0x5c/0x74
> [   48.601601] Instruction dump:
> [   48.601603] e9228120 e9290000 e9290010 792807e3 4082ff74 38600a00
> 4bffff6c 60420000 
> [   48.601609] 7c0802a6 f8010010 f821ff91 4bff1d31 <60000000> 38210070
> e8010010 7c0803a6

This dump does not look related to this bug.  Did you intend to add it to bz 1197000?

Comment 45 Sumeet Keswani 2016-09-23 14:11:45 UTC

I wonder if someone can help me with this bug..

we have a customer on RHEL 2.6.32-431.20.3.el6 , hence why i do i see this bug (990806) which was fixed in 2.6.32-422.el6. Are we missing something or has this regressed?

[92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] 
.....
.....
[92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP ProLiant DL380p Gen8 
[92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] _spin_lock+0x1e/0x30

Comment 46 Richard Guy Briggs 2016-09-24 03:29:38 UTC

(In reply to Sumeet Keswani from comment #45)
> I wonder if someone can help me with this bug..
> 
> we have a customer on RHEL 2.6.32-431.20.3.el6 , hence why i do i see this
> bug (990806) which was fixed in 2.6.32-422.el6. Are we missing something or
> has this regressed?
> 
> [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] 
> .....
> .....
> [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64
> #1 HP ProLiant DL380p Gen8 
> [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>]
> _spin_lock+0x1e/0x30

Very little information has been provided (the full bug dump is needed at least), but from what I see this does not appear to be the same cause since there was no spin_lock involved in the original case.

Comment 47 Sumeet Keswani 2016-09-26 13:11:45 UTC

Here is some more information. 
The stack is different every time, here are two instances of it.


[92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] 
[92776.414485] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support hpilo hpwdt sg power_meter serio_raw tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
[92776.414495] CPU 4 
[92776.414496] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support hpilo hpwdt sg power_meter serio_raw tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
[92776.414506] 
[92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP ProLiant DL380p Gen8 
[92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>] _spin_lock+0x1e/0x30 
[92776.414510] RSP: 0018:ffff883f58841468 EFLAGS: 00000297 
[92776.414511] RAX: 0000000000009606 RBX: ffff883f58841468 RCX: 0000000000000000 
[92776.414512] RDX: 0000000000009605 RSI: ffff883fe40c8240 RDI: ffffffff81e28290 
[92776.414513] RBP: ffffffff8100bb8e R08: ffff883fe40c8508 R09: 0000000000000000 
[92776.414514] R10: 0000000000013560 R11: 0000000000000000 R12: 0000000000000000 
[92776.414515] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 
[92776.414516] FS: 00007fc35fb7b7a0(0000) GS:ffff88014c080000(0000) knlGS:0000000000000000 
[92776.414518] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 
[92776.414518] CR2: 00007fc35fb96000 CR3: 0000003fdae3f000 CR4: 00000000001407e0 
[92776.414519] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 
[92776.414520] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 


24265] [<ffffffff8152b735>] ? page_fault+0x25/0x30 
[1109037.524267] [<ffffffff8152e37e>] ? do_page_fault+0x3e/0xa0 
[1109037.524269] [<ffffffff8152b735>] ? page_fault+0x25/0x30 
[1109037.528104] BUG: soft lockup - CPU#19 stuck for 67s! [vertica:45049] 
[1109037.528105] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support serio_raw hpilo hpwdt sg power_meter tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
[1109037.528116] CPU 19 
[1109037.528117] Modules linked in: ipmi_watchdog ipmi_devintf nfs lockd fscache auth_rpcgss nfs_acl sunrpc autofs4 bonding 8021q garp stp llc ipv6 ext3 jbd microcode iTCO_wdt iTCO_vendor_support serio_raw hpilo hpwdt sg power_meter tg3 ptp pps_core be2net lpc_ich mfd_core shpchp ext4 jbd2 mbcache sd_mod crc_t10dif pata_acpi ata_generic ata_piix hpsa dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan] 
[1109037.528127] 
[1109037.528128] Pid: 45049, comm: vertica Not tainted 2.6.32-431.20.3.el6.x86_64 #1 HP

Comment 48 Richard Guy Briggs 2016-09-26 15:42:19 UTC

(In reply to Sumeet Keswani from comment #47)
> Here is some more information. 
> The stack is different every time, here are two instances of it.

Only the first backtrace is reliable.  The rest could be caused by random scribbling on core.  This might explain why each is very different.

> [92776.414484] BUG: soft lockup - CPU#4 stuck for 67s! [snmpd:2599] 
...
> [92776.414495] CPU 4 
...
> [92776.414507] Pid: 2599, comm: snmpd Not tainted 2.6.32-431.20.3.el6.x86_64
> #1 HP ProLiant DL380p Gen8 
> [92776.414508] RIP: 0010:[<ffffffff8152b22e>] [<ffffffff8152b22e>]
> _spin_lock+0x1e/0x30 
...
> 
> 24265] [<ffffffff8152b735>] ? page_fault+0x25/0x30 
> [1109037.524267] [<ffffffff8152e37e>] ? do_page_fault+0x3e/0xa0 
> [1109037.524269] [<ffffffff8152b735>] ? page_fault+0x25/0x30 
> [1109037.528104] BUG: soft lockup - CPU#19 stuck for 67s! [vertica:45049] 
...
> [1109037.528116] CPU 19 
...
> [1109037.528128] Pid: 45049, comm: vertica Not tainted
> 2.6.32-431.20.3.el6.x86_64 #1 HP

I don't see any evidence that this is the same bug.  There is no mention of audit_log_start in the backtrace.

Please file a new bz.

Comment 49 Sumeet Keswani 2016-09-26 15:44:51 UTC

Thanks for looking at this. I will ask the customer to file a new BZ as needed.
thanks

Note You need to log in before you can comment on or make changes to this bug.

adaora.onyia
arubin
asanders
ccui
chorn
cww
dhoward
drjones
eparis
gbeshers
hhuang
jamorgan
jane.lv
jdonohue
jeff.burrell
juzhang
jvillalo
jwilleford
lcapitulino
lisa.mitchell
michen
msvoboda
mvadkert
qzhang
randerso
ruwang
rwright
sauchter
sforsber
shuang
sluo
stephan.wiesand
sumeet.keswani
tlavigne
vmware-gos-qa
vpalavarapu
wkfgktua
xiaolong.wang