Bug 231392 - LSPP: Misc soft-lockups in x86_64 lspp.67 kernel
Summary: LSPP: Misc soft-lockups in x86_64 lspp.67 kernel
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel
Version: 5.0
Hardware: All
OS: Linux
medium
high
Target Milestone: ---
: ---
Assignee: Eric Paris
QA Contact: Martin Jenner
URL:
Whiteboard:
Depends On:
Blocks: RHEL5LSPPCertTracker
TreeView+ depends on / blocked
 
Reported: 2007-03-07 23:32 UTC by Klaus Kiwi (Old account no longer used)
Modified: 2018-10-19 22:43 UTC (History)
11 users (show)

Fixed In Version: RHBA-2007-0959
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2007-11-07 19:43:12 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
soft lockup output (17.72 KB, application/octet-stream)
2007-04-10 16:54 UTC, Loulwa Salem
no flags Details
Patch to touch softlockup watchdog during policy load. (965 bytes, patch)
2007-04-12 16:45 UTC, Stephen Smalley
no flags Details | Diff
Profiler run with .75 kernel (48.24 KB, text/plain)
2007-04-18 22:30 UTC, Loulwa Salem
no flags Details


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHBA-2007:0959 0 normal SHIPPED_LIVE Updated kernel packages for Red Hat Enterprise Linux 5 Update 1 2007-11-08 00:47:37 UTC

Description Klaus Kiwi (Old account no longer used) 2007-03-07 23:32:14 UTC
Description of problem:
Misc soft lockup messages in dmesg when running lspp.67 kernel
uname -a:
Linux bracer2.ltc.austin.ibm.com 2.6.18-8.el5.lspp.67 #1 SMP Mon Feb 26 11:13:51
EST 2007 x86_64 x86_64 x86_64 GNU/Linux


Version-Release number of selected component (if applicable):
2.6.18-8.el5.lspp.67

How reproducible:
very

Steps to Reproduce:
1. load bnx2 driver
2. check dmesg and/or console

  
Actual results:
bnx2: eth0: using MSI
ADDRCONF(NETDEV_UP): eth0: link is not ready
bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
audit(1173271835.388:337): audit_pid=2291 old=0 by auid=4294967295
subj=system_u:system_r:auditd_t:s15:c0.c1023
eth0: no IPv6 routers present
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts
rip __do_softirq+0x5c/0xf3
BUG: soft lockup detected on CPU#3!

Call Trace:
 <IRQ>  [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6
 [<ffffffff80097ae9>] update_process_times+0x42/0x68
 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56
 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8024b333>] ip_mc_inc_group+0x1ca/0x200
 [<ffffffff8024b3a7>] ip_mc_up+0x3e/0x57
 [<ffffffff8024771e>] inetdev_event+0x137/0x27d
 [<ffffffff8006877c>] notifier_call_chain+0x20/0x32
 [<ffffffff80216d7c>] dev_open+0x68/0x6e
 [<ffffffff80215488>] dev_change_flags+0x5a/0x119
 [<ffffffff80247fc6>] devinet_ioctl+0x235/0x59c
 [<ffffffff8020db86>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff80042103>] do_ioctl+0x21/0x6b
 [<ffffffff800315b2>] vfs_ioctl+0x256/0x26f
 [<ffffffff8004c949>] sys_ioctl+0x59/0x78
 [<ffffffff8005ddda>] tracesys+0xd1/0xdb
bnx2: eth0: using MSI
ADDRCONF(NETDEV_UP): eth0: link is not ready
bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex
ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
eth0: no IPv6 routers present
BUG: soft lockup detected on CPU#2!

Call Trace:
 <IRQ>  [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6
 [<ffffffff80097ae9>] update_process_times+0x42/0x68
 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56
 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8003f029>] rt_run_flush+0x77/0xb0
 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8024b333>] ip_mc_inc_group+0x1ca/0x200
 [<ffffffff8024b3a7>] ip_mc_up+0x3e/0x57
 [<ffffffff802475d3>] inetdev_init+0x10b/0x11f
 [<ffffffff80248143>] devinet_ioctl+0x3b2/0x59c
 [<ffffffff8020db86>] sock_ioctl+0x1c1/0x1e5
 [<ffffffff80042103>] do_ioctl+0x21/0x6b
 [<ffffffff800315b2>] vfs_ioctl+0x256/0x26f
 [<ffffffff8004c949>] sys_ioctl+0x59/0x78
 [<ffffffff8005ddda>] tracesys+0xd1/0xdb

BUG: soft lockup detected on CPU#3!

Call Trace:
 <IRQ>  [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6
 [<ffffffff80097ae9>] update_process_times+0x42/0x68
 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47
 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56
 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70
 <EOI>  [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c
 [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0
 [<ffffffff8024c6c4>] fib_disable_ip+0x20/0x29
 [<ffffffff8024ce63>] fib_inetaddr_event+0x19d/0x1b2
 [<ffffffff8006877c>] notifier_call_chain+0x20/0x32
 [<ffffffff8009a586>] blocking_notifier_call_chain+0x22/0x36
 [<ffffffff80247058>] inet_del_ifa+0x11f/0x1f8
 [<ffffffff802479e5>] inet_rtm_deladdr+0xe1/0x103
 [<ffffffff8021d155>] rtnetlink_rcv_msg+0x1b7/0x1dd
 [<ffffffff8022abff>] netlink_run_queue+0x84/0x147
 [<ffffffff8021cf9e>] rtnetlink_rcv_msg+0x0/0x1dd
 [<ffffffff8021cf49>] rtnetlink_rcv+0x30/0x4e
 [<ffffffff8022b035>] netlink_data_ready+0x12/0x50
 [<ffffffff8022a258>] netlink_sendskb+0x26/0x40
 [<ffffffff8022b010>] netlink_sendmsg+0x2b4/0x2c7
 [<ffffffff800554b7>] sock_sendmsg+0xf3/0x110
 [<ffffffff8009ff98>] autoremove_wake_function+0x0/0x2e
 [<ffffffff800a5638>] __lock_acquire+0x9af/0xa2b
 [<ffffffff8003ec1e>] sys_recvmsg+0x18f/0x254
 [<ffffffff800a4921>] trace_hardirqs_on+0x123/0x14e
 [<ffffffff8020e73f>] sys_sendto+0x11c/0x14f
 [<ffffffff800315f2>] __up_write+0x27/0xf2
 [<ffffffff8005ddda>] tracesys+0xd1/0xdb



Additional info:
Machine is an IBM Blade server, 2x Intel Xeon 5110  @ 1.60GHz

[root/sysadm_r/SystemLow@bracer2 ~]# lspci
00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub (rev 93)
00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port
2-3 (rev 93)
00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3
(rev 93)
00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 4
(rev 93)
00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5
(rev 93)
00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 6
(rev 93)
00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7
(rev 93)
00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93)
00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93)
00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93)
00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers
(rev 93)
00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers
(rev 93)
00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 93)
00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 93)
00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express
Root Port 1 (rev 09)
00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB
Controller #1 (rev 09)
00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB
Controller #2 (rev 09)
00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB
Controller #3 (rev 09)
00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2
Controller (rev 09)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9)
00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface
Controller (rev 09)
00:1f.2 IDE interface: Intel Corporation 631xESB/632xESB/3100 Chipset SATA
Storage Controller IDE (rev 09)
01:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064E PCI-Express
Fusion-MPT SAS (rev 02)
03:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit
Ethernet (rev 12)
05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3)
06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit
Ethernet (rev 12)
07:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port
(rev 01)
07:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X
Bridge (rev 01)
09:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream
Port E1 (rev 01)
09:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream
Port E2 (rev 01)
[root/sysadm_r/SystemLow@bracer2 ~]#

Comment 1 Klaus Kiwi (Old account no longer used) 2007-03-07 23:39:15 UTC
May be totally unrelated, but got another soft lockup in another blade server.
This one is an 2x Dual-Core AMD Opteron 2212 HE:

</var/log/messages>
Mar  7 11:15:46 oracer5 kernel: BUG: soft lockup detected on CPU#2!
Mar  7 11:15:46 oracer5 kernel:
Mar  7 11:15:46 oracer5 kernel: Call Trace:
Mar  7 11:15:46 oracer5 kernel:  <IRQ>  [<ffffffff800bc58a>]
softlockup_tick+0xdb/0xf6
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80097ae9>] update_process_times+0x42/0x68
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80077e11>]
smp_local_timer_interrupt+0x23/0x47
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80078552>]
smp_apic_timer_interrupt+0x50/0x56
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70
Mar  7 11:15:46 oracer5 kernel:  <EOI>  [<ffffffff80007894>]
kmem_cache_free+0x22c/0x239
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8012c6eb>] avtab_destroy+0x2f/0x66
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff801320e1>]
cond_policydb_destroy+0x1c/0x29
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8012d474>] policydb_destroy+0x102/0x1e0
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80065f7b>] _write_unlock_irq+0x26/0x27
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80130cd4>]
security_load_policy+0x233/0x28c
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8005e1dc>] restore_args+0x0/0x30
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff800c872b>] zone_statistics+0x3e/0x6d
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8012975f>] sel_write_load+0xbc/0x31b
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff80016dfa>] vfs_write+0xce/0x174
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff800176b6>] sys_write+0x45/0x6e
Mar  7 11:15:46 oracer5 kernel:  [<ffffffff8005ddda>] tracesys+0xd1/0xdb

And in this same system, loading bnx2 module load doesn't cause soft lockups,
but this instead:

</var/log/messages>
Mar  7 11:20:33 oracer5 kernel: bnx2: eth0: No interrupt was generated using
MSI, switching to INTx mode. Please report this failure to the PCI maintainer
and include system chipset information

<lspci>
[root/sysadm_r/SystemLow@oracer5 log]# lspci
00:01.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge
00:02.0 Host bridge: Broadcom HT1000 Legacy South Bridge
00:02.2 ISA bridge: Broadcom HT1000 LPC Bridge
00:03.0 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:03.1 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:03.2 USB Controller: Broadcom HT1000 USB Controller (rev 01)
00:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02)
00:06.0 PCI bridge: Broadcom HT2000 PCI-X bridge (rev b0)
00:07.0 PCI bridge: Broadcom HT2000 PCI-X bridge (rev b0)
00:08.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0)
00:09.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0)
00:0a.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0)
00:0b.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0)
00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
HyperTransport Technology Configuration
00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map
00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM
Controller
00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron]
Miscellaneous Control
01:0d.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge (rev c0)
02:04.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706S Gigabit
Ethernet (rev 02)
02:05.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706S Gigabit
Ethernet (rev 02)
03:04.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X
Fusion-MPT SAS (rev 02)
[root/sysadm_r/SystemLow@oracer5 log]#

Comment 2 Linda Knippers 2007-03-21 19:12:36 UTC
I booted the test kernel and it boots and seems to run fine on my
x86_64 targeted system.  It didn't have the problem of hanging/pausing
when starting udev that I saw the the .69 kernel and I'm not seeing
any soft lockups.

Comment 3 Eric Paris 2007-03-21 19:15:12 UTC
In this case the 'test kernel' is actually 2.6.18-12 a devel kernel post GA
which 'should' have the softlock detection stuff turned on but includes none of
the LSPP patch changes.

Is your system doing any auditing?

Comment 4 Linda Knippers 2007-03-21 19:18:13 UTC
I have auditing enabled although no audit rules loaded, same as when I
was attempting to run the .69 kernel.

With the .69 kernel I also had the soft lockup and hang/pause when
starting udev when I booted with selinux=0, in case that helps.
I eventually got an nmi watchdog panic.

Comment 5 Kylene J Hall 2007-03-21 20:03:38 UTC
2.6.18-12 also solved the issue that I was having with network_object tests.  I
also have auditing enabled but no actual rules loaded.

Comment 6 Eric Paris 2007-03-22 14:44:19 UTC
Since this does not exist in 2.6.18-12 or upstream 2.6.21-rc4 it must be in the
LSPP patches which makes it my problem.  Drat.  Assigning to eparis

Comment 7 Klaus Kiwi (Old account no longer used) 2007-03-22 15:51:54 UTC
This is blocking the LSPP eval effort so raising it to high. Also adding
Iboverma as cc

Comment 8 Eric Paris 2007-03-22 21:15:36 UTC
So I rebuilt 2.6.18-12 with the same .config I'm using in LSPP and I see these
problems on my dl380g5

I diffed the .config I'm using and the one in the -12 rpm which was tested here
by IBM and HP and it seems to have 2 differences:

--- configs/kernel-2.6.18-x86_64.config 2007-03-22 15:07:33.000000000 -0400
+++ .config     2007-03-22 15:09:05.000000000 -0400
@@ -2793,7 +2792,7 @@ CONFIG_DEBUG_RWSEMS=y
 CONFIG_DEBUG_LOCK_ALLOC=y
 CONFIG_PROVE_LOCKING=y
 CONFIG_LOCKDEP=y
-# CONFIG_DEBUG_LOCKDEP is not set
+CONFIG_DEBUG_LOCKDEP=y
 CONFIG_TRACE_IRQFLAGS=y
 CONFIG_DEBUG_SPINLOCK_SLEEP=y
 # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
@@ -2803,7 +2802,7 @@ CONFIG_DEBUG_INFO=y
 CONFIG_DEBUG_FS=y
 CONFIG_DEBUG_VM=y
 CONFIG_DEBUG_LIST=y
-# CONFIG_FRAME_POINTER is not set
+CONFIG_FRAME_POINTER=y
 # CONFIG_FORCED_INLINING is not set
 CONFIG_BOOT_DELAY=y
 # CONFIG_RCU_TORTURE_TEST is not set


I'm going to try again turning off LOCK_DEP and see what happens.  I'm also
going to check out what happened with the 2.6.21-rc4 kernel since i think i had
lockdep on there too....

Comment 9 Eric Paris 2007-03-22 21:31:22 UTC
I dropped CONFIG_DEBUG_LOCKDEP rebuilt 2.6.18-12 and it is working just fine. 
Uggh, going to go do the same on the LSPP kernel then look into why.

Comment 10 Linda Knippers 2007-03-23 15:22:15 UTC
I'm running the lspp.70 kernel and my system is much happier now.
Thank you.

Comment 11 Klaus Kiwi (Old account no longer used) 2007-03-23 19:46:44 UTC
everything seems fine now here too, thanks and you may close the bug

Comment 12 Kylene J Hall 2007-03-23 19:59:19 UTC
Works for me too.

Comment 13 Loulwa Salem 2007-03-26 22:22:51 UTC
As per the call earlier today ... here is the soft lockup I saw ...
This occured on a ppc system with the latest everything including latest
policy-47 and kernel lspp.70

I was trying to do an semanage command, semanage failed due to context change. I
did a restorecon on the modules directory and when I tried the semanage command
again, I saw the lockup; the command succeded though.

[root/sysadm_r/SystemLow@joy-hv4 ~]# semanage login -m -s staff_u ealuser
libsemanage.semanage_commit_sandbox: Error while renaming
/etc/selinux/mls/modules/active to /etc/selinux/mls/modules/previous.
/usr/sbin/semanage: Could not modify login mapping for ealuser
[root/sysadm_r/SystemLow@joy-hv4 ~]# ls -Zd /etc/selinux/mls/modules/active/
drwx------  root root abat_u:object_r:selinux_config_t:SystemLow
/etc/selinux/mls/modules/active/
[root/sysadm_r/SystemLow@joy-hv4 ~]# restorecon -r /etc/selinux/mls/modules/
[root/sysadm_r/SystemLow@joy-hv4 ~]# ls -Zd /etc/selinux/mls/modules/active/
drwx------  root root system_u:object_r:semanage_store_t:SystemLow
/etc/selinux/mls/modules/active/
[root/sysadm_r/SystemLow@joy-hv4 ~]# semanage login -m -s staff_u ealuser
BUG: soft lockup detected on CPU#2!
Call Trace:
[C00000003EE9AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000003EE9B000] [C0000000000A144C] .softlockup_tick+0xec/0x124
[C00000003EE9B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000003EE9B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000003EE9B200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C00000003EE9B5A0] [C000000000183934] .cond_av_list_destroy+0x20/0x44
[C00000003EE9B630] [C00000000018399C] .cond_node_destroy+0x44/0x70
[C00000003EE9B6C0] [C0000000001839E8] .cond_list_destroy+0x20/0x40
[C00000003EE9B750] [C00000000017CEE0] .policydb_destroy+0x144/0x280
[C00000003EE9B7F0] [C000000000182454] .security_load_policy+0x278/0x31c
[C00000003EE9BC10] [C00000000017767C] .sel_write_load+0xd0/0x340
[C00000003EE9BCF0] [C0000000000E203C] .vfs_write+0x118/0x200
[C00000003EE9BD90] [C0000000000E27AC] .sys_write+0x4c/0x8c
[C00000003EE9BE30] [C00000000000869C] syscall_exit+0x0/0x40

Comment 14 Eric Paris 2007-03-27 04:26:00 UTC
At this point only the data in comment #13 is of interest.

Is it reproducable?  If so can you tell me exactly the hardware so I can try to
get it here in house?  The RHEL5 kernel is very close to upstream in this area.
 I can't in fact think of any customizations to policy load.

Stephen you usually have good thoughts right offhand on this sort of stuff....

Comment 15 Stephen Smalley 2007-03-27 12:31:04 UTC
We only hold the policy wrlock for the small critical section where we are
installing the new policydb and sidtab into place, and we release it before
calling policydb_destroy on the old policydb, so I don't see any reason why
SELinux would cause such a soft lockup.  Given that the earlier soft lockups had
nothing to do with SELinux, it doesn't sound selinux-related.
I notice that the original bug report included the following:
warning: many lost ticks.
Your time source seems to be instable or some driver is hogging interupts


Comment 16 Klaus Kiwi (Old account no longer used) 2007-03-27 12:46:39 UTC
(In reply to comment #15)
> I notice that the original bug report included the following:
> warning: many lost ticks.
> Your time source seems to be instable or some driver is hogging interupts
> 

This was true for a x86_64 machine but not for a ppc64 machine as far as I can
tell. I'll get back to the machine in comment #0 and see if the lost ticks
problem still exist, but I don't think it might be related with comment #13


Comment 17 Loulwa Salem 2007-03-28 15:40:20 UTC
I was not able to reproduce the soft lockup in comment #13 yet. I tried the 
command on another system of the same type and didn't see it ... I'll keep 
trying to reproduce it and if it happens again I'll report it right away.

Comment 18 George C. Wilson 2007-04-02 20:18:58 UTC
Mostly solved, one still seen. Eric and SDS don't understand why. CPU was slow
getting back, not hung. This will fall by the wayside if somebody cannot
reproduce it.

Comment 19 George C. Wilson 2007-04-09 20:29:04 UTC
Still not recreated as of today.

Comment 20 Loulwa Salem 2007-04-10 16:54:36 UTC
Created attachment 152156 [details]
soft lockup output

Comment 21 Loulwa Salem 2007-04-10 18:17:31 UTC
Another soft lockup. I just put an attachement in comment #20. Saw this lockup
on ppc running .71 kernel. Will update my kernel and see if I get it again.

I was running our tests overnight, and saw the lockup this morning when I came in.

Comment 22 Stephen Smalley 2007-04-10 18:32:07 UTC
What are those tests doing, particularly that is triggering policy reloads?
And what other output happens in /var/log/messages at the time?
Not directly related per se, but semanage login shouldn't need to reload policy.


Comment 23 Loulwa Salem 2007-04-11 15:18:25 UTC
Yet another soft lockup... This also happened over night as I left our test
suite running.
This is also on ppc running lspp.73 kernel and latest everything (this system
was installed fresh yesterday)

We have tests that do policy reloads .. I am running them now to maybe see the
softlockup when it happens. I'll update this bug. Here is the lockup output

[root/sysadm_r/SystemLow@joy-hv4 ~]# security:  invalidating context
syssecadm_u:sysadm_r:sysadm_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:secadm_r:secadm_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:object_r:sysadm_devpts_t:s0
security:  invalidating context syssecadm_u:sysadm_r:hostname_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:sysadm_r:consoletype_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:sysadm_r:newrole_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:syssecadm_r:sysadm_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:sysadm_r:system_chkpwd_t:s0-s15:c0.c1023
security:  invalidating context syssecadm_u:object_r:user_home_t:s0
security:  invalidating context syssecadm_u:object_r:secadm_devpts_t:s0
BUG: soft lockup detected on CPU#2!
Call Trace:
[C0000000439C6F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C0000000439C7000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C0000000439C70A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C0000000439C7120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C0000000439C7200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C0000000439C75A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C0000000439C7630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C0000000439C76C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C0000000439C7750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C0000000439C77F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C0000000439C7C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C0000000439C7CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C0000000439C7D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C0000000439C7E30] [C00000000000869C] syscall_exit+0x0/0x40
ip_tables: (C) 2000-2006 Netfilter Core Team
security:  invalidating context tmpuser100960_u:user_r:user_t:s0-s15:c0.c1023
security:  invalidating context tmpuser100960_u:user_r:user_t:s0
security:  invalidating context tmpuser100960_u:object_r:user_devpts_t:s0
security:  invalidating context tmpuser100960_u:user_r:passwd_t:s0
BUG: soft lockup detected on CPU#4!
Call Trace:
[C00000007838EFF0] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000007838F090] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C00000007838F130] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000007838F1B0] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000007838F290] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4
    LR = .kmem_cache_free+0x290/0x2d4
[C00000007838F630] [C00000000017BF30] .avtab_destroy+0x58/0xb0
[C00000007838F6C0] [C0000000001844CC] .cond_policydb_destroy+0x30/0x4c
[C00000007838F750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C00000007838F7F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C00000007838FC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C00000007838FCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C00000007838FD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C00000007838FE30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#1!
Call Trace:
[C0000000772FAF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C0000000772FB000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C0000000772FB0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C0000000772FB120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C0000000772FB200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C0000000772FB5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C0000000772FB630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C0000000772FB6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C0000000772FB750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C0000000772FB7F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C0000000772FBC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C0000000772FBCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C0000000772FBD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C0000000772FBE30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#4!
Call Trace:
[C00000000754EF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000000754F000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C00000000754F0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000000754F120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000000754F200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C00000000754F5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C00000000754F630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C00000000754F6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C00000000754F750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C00000000754F7F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C00000000754FC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C00000000754FCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C00000000754FD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C00000000754FE30] [C00000000000869C] syscall_exit+0x0/0x40
security:  invalidating context tmpuser100961_u:user_r:user_t:s0-s15:c0.c1023
security:  invalidating context tmpuser100961_u:user_r:user_t:s8
security:  invalidating context tmpuser100961_u:object_r:user_devpts_t:s8
security:  invalidating context tmpuser100961_u:user_r:passwd_t:s8
BUG: soft lockup detected on CPU#0!
Call Trace:
[C00000006BA6AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000006BA6B000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C00000006BA6B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000006BA6B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000006BA6B200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C00000006BA6B5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C00000006BA6B630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C00000006BA6B6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C00000006BA6B750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C00000006BA6B7F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C00000006BA6BC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C00000006BA6BCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C00000006BA6BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C00000006BA6BE30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#0!
Call Trace:
[C000000083BE2F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C000000083BE3000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C000000083BE30A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C000000083BE3120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C000000083BE3200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C000000083BE35A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C000000083BE3630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C000000083BE36C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C000000083BE3750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C000000083BE37F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C000000083BE3C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C000000083BE3CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C000000083BE3D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C000000083BE3E30] [C00000000000869C] syscall_exit+0x0/0x40
security:  invalidating context tmpuser100962_u:user_r:user_t:s0-s15:c0.c1023
security:  invalidating context tmpuser100962_u:user_r:user_t:s15:c0.c1023
security:  invalidating context tmpuser100962_u:object_r:user_devpts_t:s15:c0.c1023
security:  invalidating context tmpuser100962_u:user_r:passwd_t:s15:c0.c1023
BUG: soft lockup detected on CPU#5!
Call Trace:
[C000000083BE2F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C000000083BE3000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C000000083BE30A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C000000083BE3120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C000000083BE3200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C000000083BE35A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C000000083BE3630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C000000083BE36C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C000000083BE3750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C000000083BE37F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C000000083BE3C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C000000083BE3CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C000000083BE3D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C000000083BE3E30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#5!
Call Trace:
[C00000006204B130] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000006204B1D0] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C00000006204B270] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000006204B2F0] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000006204B3D0] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kmem_cache_free+0x10/0x2d4
    LR = .avtab_destroy+0x58/0xb0
[C00000006204B750] [C00000000017D224] .policydb_destroy+0xb4/0x280
[C00000006204B7F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C00000006204BC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C00000006204BCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C00000006204BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C00000006204BE30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#5!
Call Trace:
[C000000062043080] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C000000062043120] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C0000000620431C0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C000000062043240] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C000000062043320] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4
    LR = .kmem_cache_free+0x290/0x2d4
[C0000000620436C0] [C00000000017BF30] .avtab_destroy+0x58/0xb0
[C000000062043750] [C00000000017D224] .policydb_destroy+0xb4/0x280
[C0000000620437F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C000000062043C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C000000062043CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C000000062043D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C000000062043E30] [C00000000000869C] syscall_exit+0x0/0x40
INIT: version 2.86 reloading

Comment 24 Stephen Smalley 2007-04-11 15:43:31 UTC
So, looking at linux-kernel for similar kinds of reports, I see:
http://marc.info/?l=linux-kernel&m=117509781219133&w=2
That was for a "bogus" softlockup warning during sysrq-t due to it taking a long
time during the dump to console while holding the tasklist lock.

In our case, we are performing a fair amount of processing for the policy reload
itself (parsing the image, building up a new policydb, validating class and perm
values, cloning and converting the sidtab, tearing down the old policydb/sidtab)
plus printk output for invalidated contexts, while holding sel_mutex in
selinuxfs to synchronize with other loads or boolean accesses, load_mutex in
security_load_policy() to synchronize with other loads (hmmm...that seems
obsoleted by sel_mutex in the caller), and taking policy_wrlock and sidtab lock
for very short, fixed sequences.

sel_mutex and the seemingly obsolete load_mutex only interfere with other loads
or boolean ops, not other processing.

policy_wrlock and sidtab lock are only held for short, fixed durations in the
policy loading code, but could be forced to wait a bit for other threads to
release them.

Possibly the locking could be refactored, although most operations don't reach
the security server at all and just use the AVC and already computed SIDs.

Short term workaround might be to touch the softlockup watchdog in the policy
load code.  Or just shut it off for production.



Comment 25 Stephen Smalley 2007-04-11 15:48:58 UTC
And just to clarify, does it truly lockup ever or it is just taking a long time
to complete?  

Comment 26 Loulwa Salem 2007-04-11 16:18:40 UTC
It does not actually lockup, it shows the soft lockup on the screen but the test
case continues to execute and completes
Basically the test case is exercising the following
  - semodule -R/i/u/r

I was able to run the test case individually and caused the soft lockup once. A
second run didn't cause a lockup.
looking in /var/log/messages I don't see anything more than the soft lockup info
(as above) and the policy info .. as in ..

Apr 11 06:17:36 joy-hv4 kernel: security:  7 users, 8 roles, 1772 types, 95
bools, 16 sens, 1024 cats
Apr 11 06:17:36 joy-hv4 kernel: security:  59 classes, 148384 rules
Apr 11 06:17:46 joy-hv4 kernel: BUG: soft lockup detected on CPU#0!
Apr 11 06:17:46 joy-hv4 kernel: Call Trace:
Apr 11 06:17:46 joy-hv4 kernel: [C000000047D47080] [C0000000000109C0]
.show_stack+0x68/0x1b0 (unreliable)
Apr 11 06:17:46 joy-hv4 kernel: [C000000047D47120] [C0000000000A1820]
.softlockup_tick+0xec/0x124
...
....

Comment 27 Steve Grubb 2007-04-11 19:37:12 UTC
I think the softlockups are the watchdog timer timing out. I'd think that the
simplest solution since we keep softlockup detection on for production kernels
is to call the touch watchdog function.

Comment 28 Stephen Smalley 2007-04-12 16:45:19 UTC
Created attachment 152482 [details]
Patch to touch softlockup watchdog during policy load.

This patch adds calls to touch the softlockup watchdog just before cloning the
SID table and just before tearing down the old policydb/sidtab (or new ones on
the error path).  See if this helps.

Comment 29 Steve Grubb 2007-04-12 20:42:12 UTC
Patch was applied to lspp.74 kernel. Can this please be re-tested? Thanks.

Comment 30 Loulwa Salem 2007-04-12 21:25:42 UTC
Thanks Steve, I just installed the .74, relabeled and rebooted .. 
I tried my test and I still see a soft lockup ...
The test still continues to execute and finishes. I ran the test twice and got
the lockup both times

[root/sysadm_r/SystemLow@joy-hv4 ~]# BUG: soft lockup detected on CPU#4!
Call Trace:
[C0000000388BAF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C0000000388BB000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C0000000388BB0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C0000000388BB120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C0000000388BB200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C0000000388BB5A0] [C000000000183D20] .cond_av_list_destroy+0x20/0x44
[C0000000388BB630] [C000000000183D88] .cond_node_destroy+0x44/0x70
[C0000000388BB6C0] [C000000000183DD4] .cond_list_destroy+0x20/0x40
[C0000000388BB750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C0000000388BB7F0] [C000000000182868] .security_load_policy+0x288/0x334
[C0000000388BBC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C0000000388BBCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C0000000388BBD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C0000000388BBE30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#5!
Call Trace:
[C00000000FF2AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C00000000FF2B000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C00000000FF2B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C00000000FF2B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C00000000FF2B200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0x294/0x2d8
[C00000000FF2B5A0] [C000000000183D20] .cond_av_list_destroy+0x20/0x44
[C00000000FF2B630] [C000000000183D88] .cond_node_destroy+0x44/0x70
[C00000000FF2B6C0] [C000000000183DD4] .cond_list_destroy+0x20/0x40
[C00000000FF2B750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C00000000FF2B7F0] [C000000000182868] .security_load_policy+0x288/0x334
[C00000000FF2BC10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C00000000FF2BCF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C00000000FF2BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C00000000FF2BE30] [C00000000000869C] syscall_exit+0x0/0x40

Comment 31 Stephen Smalley 2007-04-13 12:41:26 UTC
What is your hardware?
And you aren't running within a virtual machine, are you?

Comment 33 Loulwa Salem 2007-04-13 15:43:26 UTC
I am not running a virtual machine. I am on a ppc system ..
uname -a ==>> 2.6.18-8.1.1.lspp.74.el5 #1 SMP Thu Apr 12 14:19:44 EDT 2007 ppc64
ppc64 ppc64 GNU/Linux

[root/abat_r/SystemLow@joy-hv4 LTP]# cat /proc/cpuinfo
processor       : 0
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 1
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 2
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 3
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 4
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 5
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 6
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

processor       : 7
cpu             : POWER5 (gr)
clock           : 1502.496000MHz
revision        : 2.2 (pvr 003a 0202)

timebase        : 188044000
platform        : pSeries
machine         : CHRP IBM,9124-720

Comment 34 Klaus Kiwi (Old account no longer used) 2007-04-13 22:49:19 UTC
May be unrelated but I thought I should report here anyway.
This was copied from a 3270 terminal right after updating a s390x:


=============================================
[ INFO: possible recursive locking detected ]
2.6.18-8.1.1.lspp.75.el5 #1
---------------------------------------------
dasdfmt/1537 is trying to acquire lock:
 (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c

but task is already holding lock:
 (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c

other info that might help us debug this:
1 lock held by dasdfmt/1537:
 #0:  (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c

stack backtrace:
0000000079c1b560 0000000079c1b5f0 0000000000000002 0000000000000000
       0000000079c1b690 0000000079c1b608 0000000079c1b608 000000000001663a
       0000000000000000 0000000000000000 000000000000000a 000000000000000a
       0000000000000000 0000000079c1b5f0 0000000079c1b5f0 0000000079c1b668
       000000000025dd80 000000000001663a 0000000079c1b5f0 0000000079c1b648
Call Trace:
([<0000000000016586>] show_trace+0x122/0x13c)
 [<0000000000016658>] show_stack+0xb8/0xc8
 [<0000000000016696>] dump_stack+0x2e/0x3c
 [<000000000005e6ce>] __lock_acquire+0xab6/0xb18
 [<000000000005eec4>] lock_acquire+0x88/0xac
 [<000000000024c9bc>] __mutex_lock_slowpath+0x130/0x31c
 [<000000000024cbe6>] mutex_lock+0x3e/0x4c
 [<0000000000159930>] blkdev_ioctl+0x64c/0x7f0
 [<00000000000b55ec>] ioctl_by_bdev+0x74/0xbc
 [<000000008085c686>] dasd_destroy_partitions+0x66/0xc0 [dasd_mod]
 [<0000000080855c1a>] dasd_change_state+0x322/0x664 [dasd_mod]
 [<0000000080855fe2>] dasd_set_target_state+0x86/0x94 [dasd_mod]
 [<00000000808595d0>] dasd_ioctl+0x124/0x6c4 [dasd_mod]
 [<000000000015926e>] blkdev_driver_ioctl+0x8e/0xa8
 [<0000000000159a6e>] blkdev_ioctl+0x78a/0x7f0
 [<00000000000b5676>] block_ioctl+0x42/0x54
 [<00000000000c3da6>] do_ioctl+0x56/0xb8
 [<00000000000c413a>] vfs_ioctl+0x332/0x350
 [<00000000000c41d2>] sys_ioctl+0x7a/0xa0
 [<000000000001fcec>] sysc_tracego+0xe/0x14
 [<0000004f3b2f2c82>] 0x4f3b2f2c82



Comment 35 Steve Grubb 2007-04-14 12:14:42 UTC
lspp.76 kernel was built in attempt to solve this problem. It increaese the NMI
watchdog timer value. Please give it a try and let us know if the problem is
solved or still exists.  Thanks.

Comment 36 Klaus Kiwi (Old account no longer used) 2007-04-14 14:06:29 UTC
is this kernet built with debugging symbols? It's size decreased from 67MB to
just  16MB.


Comment 37 Klaus Kiwi (Old account no longer used) 2007-04-16 13:51:39 UTC
I can't reproduce this anymore in my JS21.

As Steve requested, this is the 'time' value for the load_policy, when going 

===> from strict to mls:
[root/user_r/@js21racer1 ~]# time load_policy

real    0m1.666s
user    0m0.146s
sys     0m1.356s
[root/user_r/@js21racer1 ~]#

===> from mls to strict:
[root/user_r/@js21racer1 ~]# time load_policy
libsepol.sepol_genbools_array: boolean allow_httpd_auditadm_script_anon_write no
longer in policy
libsepol.sepol_genbools_array: boolean allow_httpd_secadm_script_anon_write no
longer in policy
libsepol.sepol_genbools_array: boolean allow_ipsec_label no longer in policy
libsepol.sepol_genbools_array: boolean allow_netlabel no longer in policy

real    0m1.430s
user    0m0.160s
sys     0m1.263s
[root/user_r/@js21racer1 ~]#

==> from strict to targeted:
[root/user_r/@js21racer1 ~]# time load_policy
libsepol.sepol_genbools_array: boolean allow_gpg_execstack no longer in policy
libsepol.sepol_genbools_array: boolean allow_httpd_staff_script_anon_write no
longer in policy
libsepol.sepol_genbools_array: boolean allow_httpd_sysadm_script_anon_write no
longer in policy
libsepol.sepol_genbools_array: boolean allow_httpd_user_script_anon_write no
longer in policy
libsepol.sepol_genbools_array: boolean allow_mplayer_execstack no longer in policy
libsepol.sepol_genbools_array: boolean allow_ptrace no longer in policy
libsepol.sepol_genbools_array: boolean allow_ssh_keysign no longer in policy
libsepol.sepol_genbools_array: boolean allow_user_mysql_connect no longer in policy
libsepol.sepol_genbools_array: boolean allow_write_xshm no longer in policy
libsepol.sepol_genbools_array: boolean cdrecord_read_content no longer in policy
libsepol.sepol_genbools_array: boolean cron_can_relabel no longer in policy
libsepol.sepol_genbools_array: boolean disable_evolution_trans no longer in policy
libsepol.sepol_genbools_array: boolean disable_games_trans no longer in policy
libsepol.sepol_genbools_array: boolean disable_mozilla_trans no longer in policy
libsepol.sepol_genbools_array: boolean disable_thunderbird_trans no longer in policy
libsepol.sepol_genbools_array: boolean mail_read_content no longer in policy
libsepol.sepol_genbools_array: boolean mozilla_read_content no longer in policy
libsepol.sepol_genbools_array: boolean pppd_for_user no longer in policy
libsepol.sepol_genbools_array: boolean read_untrusted_content no longer in policy
libsepol.sepol_genbools_array: boolean run_ssh_inetd no longer in policy
libsepol.sepol_genbools_array: boolean secure_mode no longer in policy
libsepol.sepol_genbools_array: boolean spamassassin_can_network no longer in policy
libsepol.sepol_genbools_array: boolean ssh_sysadm_login no longer in policy
libsepol.sepol_genbools_array: boolean staff_read_sysadm_file no longer in policy
libsepol.sepol_genbools_array: boolean user_direct_mouse no longer in policy
libsepol.sepol_genbools_array: boolean user_dmesg no longer in policy
libsepol.sepol_genbools_array: boolean user_net_control no longer in policy
libsepol.sepol_genbools_array: boolean user_ping no longer in policy
libsepol.sepol_genbools_array: boolean user_rw_noexattrfile no longer in policy
libsepol.sepol_genbools_array: boolean user_tcp_server no longer in policy
libsepol.sepol_genbools_array: boolean user_ttyfile_stat no longer in policy
libsepol.sepol_genbools_array: boolean write_untrusted_content no longer in policy
libsepol.sepol_genbools_array: boolean xdm_sysadm_login no longer in policy

real    0m1.496s
user    0m0.078s
sys     0m1.268s
[root/user_r/@js21racer1 ~]#

(no softlockups seen on the console or messages log. This machine is a ppc64
2xPPC970MP dual-core)

Comment 38 George C. Wilson 2007-04-16 20:16:32 UTC
Loulwa, please restest this and comment.

Comment 39 Jon Masters 2007-04-16 20:20:52 UTC
It shouldn't be occuring any more, since the time limit on NMI watchdog hitting
is now consistent with RHEL3/4 behavior and at 30 seconds. I spent some time
looking at this on Friday - please do CC me on any non-BZ discussion around this
bug.

Note that in a future RHEL5 update, the NMI watchdog will likely default to 30
seconds, and may in fact default to disabled on some platforms. Comments and
input on that kind of fix are, of course, very welcome.

Jon.


Comment 40 Eric Paris 2007-04-16 20:35:04 UTC
Sorry I was out last week and the patches for this got a little off track this
week.  Kernels .74 and .75 had a patch which should have caused these messages
to not appeared.  

This patch was pulled in .76 and a patch to lengthen the nmi watchdog timeout
was added.  Although this is a good patch to have (some hardware, core 2 duo
systems in particular) this isn't the nmi timer that is the problem.  This soft
lockup code is triggered by the normal timer interrupt and is inside
softlockup_tick()

So there is nothing in .76 which should change the soft lock up detection
message discussed in this bug.

I'm going to take a look this week to see what I can find why sel_load_policy
takes 10+ seconds but just making the messages go away might be the way we go in
the LSPP kernel.  Locking changes as discussed by sds might be the right way to
go upstream, but for the LSPP kernel something a bit quicker (and safer) will
probably be the right thing to do.

All information anyone has on how to reproduce this will be a lot of help.  I've
heard it only happens on 8 way systems and the test case is in comment #26. 
Possibly what is included in the policy module in question could be useful.

So really testing to see if it can be hit on .74/.75 and if not if it comes back
on .76 could be interesting, but i'm not sure it will help me track down the
problem.....

Comment 41 Loulwa Salem 2007-04-16 23:30:04 UTC
I am running on the latest kernel .76 with latest packages and policy. I still
see the soft lockup. Again I am running the same test case which basically does
a "semodule -R/u/i/r" as different users.

[root/sysadm_r/SystemLow@joy-hv4 ~]# BUG: soft lockup detected on CPU#5!
Call Trace:
[C000000046312F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C000000046313000] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C0000000463130A0] [C00000000007339C] .run_local_timers+0x1c/0x30
[C000000046313120] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C000000046313200] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kfree+0x2b0/0x2d8
    LR = .kfree+0xe0/0x2d8
[C0000000463135A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44
[C000000046313630] [C000000000183D70] .cond_node_destroy+0x44/0x70
[C0000000463136C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40
[C000000046313750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C0000000463137F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C000000046313C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C000000046313CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C000000046313D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C000000046313E30] [C00000000000869C] syscall_exit+0x0/0x40
BUG: soft lockup detected on CPU#3!
Call Trace:
[C000000002316FF0] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable)
[C000000002317090] [C0000000000A1820] .softlockup_tick+0xec/0x124
[C000000002317130] [C00000000007339C] .run_local_timers+0x1c/0x30
[C0000000023171B0] [C000000000023CF8] .timer_interrupt+0xa8/0x464
[C000000002317290] [C0000000000034EC] decrementer_common+0xec/0x100
--- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4
    LR = .kmem_cache_free+0xdc/0x2d4
[C000000002317630] [C00000000017BF30] .avtab_destroy+0x58/0xb0
[C0000000023176C0] [C0000000001844CC] .cond_policydb_destroy+0x30/0x4c
[C000000002317750] [C00000000017D2B4] .policydb_destroy+0x144/0x280
[C0000000023177F0] [C000000000182858] .security_load_policy+0x278/0x31c
[C000000002317C10] [C000000000177A50] .sel_write_load+0xd0/0x340
[C000000002317CF0] [C0000000000E2410] .vfs_write+0x118/0x200
[C000000002317D90] [C0000000000E2B80] .sys_write+0x4c/0x8c
[C000000002317E30] [C00000000000869C] syscall_exit+0x0/0x40

[root/sysadm_r/SystemLow@joy-hv4 ~]# uname -a
Linux joy-hv4.ltc.austin.ibm.com 2.6.18-8.1.1.lspp.76.el5 #1 SMP Fri Apr 13
16:12:34 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux

Please let me know what else I can do to help. I have tried this on a .74 kernel
and saw the soft lockup as well (as in comment #30), but I did not try the .75
kernel.

Comment 42 Eric Paris 2007-04-17 18:08:45 UTC
No surprise .76 still shows the problem, there are no patches in .76 which
address this issue.  comment #30 is interesting (thanks for pointing it out) as
it greatly shinks the area of code where the delay is taking place.  I'm going
to look for an 8 way machine and start code diving to see anything that might
take so much time.....

Comment 43 Eric Paris 2007-04-18 20:30:17 UTC
Got some hardware which might be close to yours, going to start playing with
this myself.  How much memory does your system have?  But with the knowledge in
comment #30 we can see that the 'problem' is in policydb_destroy() which does
little other than run lists and kfree memory.  I'm sure the problem is made
worse by the fact that we have the slab debugging on thus making kfree more
expensive.

Would it be possible to get some oprofile data from you on this?  I'd prefer you
run the .75 kernel (not .76) since it has the soft lock touch right before
policydb_destroy().  Might be a little harder to trigger but will be 'better.' 
I dropped the kernel and debuginfo on my people page:

http://people.redhat.com/eparis/LSPP-PPC-Debuginfo/

should be as simple as something like installing the kernel and debuginfos then

opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/2.6.18-8.el5/vmlinux
opcontrol --start
  $DO_TEST
opcontrol --stop
opreport -l /usr/lib/debug/lib/modules/2.6.18-8.el5/vmlinux >  $OUTPUT_FILE

and then give me the contents of the output_file.

Thanks,
Eric

Comment 44 Loulwa Salem 2007-04-18 22:30:40 UTC
Created attachment 152964 [details]
Profiler run with .75 kernel

Comment 45 Loulwa Salem 2007-04-18 22:35:30 UTC
Comment on attachment 152964 [details]
Profiler run with .75 kernel

I installed the lspp.75 kernel as requested above, I ran my test case twice and
was unable to get the soft-lockup. However I wanted to run the test case with
the profiler running anyway and that's when I saw the soft-lockup.

Attached is the profiler output.

Comment 46 Eric Paris 2007-04-19 17:35:23 UTC
I assume the softlock was in policydb_destroy again?

Comment 47 Loulwa Salem 2007-04-19 18:21:49 UTC
yes .. in policydb_destroy. The lockup is exactly the same as in the previous
posts, so I didn't want to post it again and again 

Comment 48 Eric Paris 2007-04-19 20:30:19 UTC
After some looking and some private discussions with others I think I'm not
going to do anything with this bug.  The top 2 time functions (by FAR) in your
test were:

121202   23.9879  .check_poison_obj
78247    15.4864  .check_slabp

accounting for 39% of the time.  Since in the final LSPP kernel debug functions
like these functions won't be there at all I see this as a non-issue.  Lets
leave this open until the 'final' build comes along without any debug options. 
Long term I'm going to put it on the list to look at the way we allocate/free
memory for the policydb but I don't see any reason at all why the issue under
discussion here is problomatic or troubling from either a functionality or
certification point of view.

Comment 51 Steve Grubb 2007-04-24 18:18:56 UTC
lspp.77 kernel was built with a new patch addressing this issue. Can someone
check it and see if the problem has been solved? Thanks.

Comment 52 Steve Grubb 2007-04-25 20:37:20 UTC
It was reported on IRC that lspp.77 fixed the problem. Removing from tracker.

Comment 54 Don Zickus 2007-06-16 00:31:39 UTC
in 2.6.18-27.el5
You can download this test kernel from http://people.redhat.com/dzickus/el5

Comment 55 Klaus Kiwi (Old account no longer used) 2007-06-19 19:59:22 UTC
(In reply to comment #54)
> in 2.6.18-27.el5
> You can download this test kernel from http://people.redhat.com/dzickus/el5

Sorry, I lost track of what this fix is for. The issue was originally reported
for a LSPP kernel.

Is this fix specific for ppc64 or more arches? Is the testcase (unload/load
policy) still valid?

Thanks,

 Klaus K.

Comment 56 Eric Paris 2007-06-19 21:59:32 UTC
The fix was general, not ppc specific.  It appeared to only be reproducible on a
specific PPC machine during one particular IBM test suite which was loading and
unloading an empty policy module.  I was never actually able to trigger the
issue on a similar machine here inside Red Hat.  Its going to be a hard one to
say 'it is fixed' since you can't tell if the 'fix' in that situation was from
the removal of the debug options or from the addition of voluntary preemption
points....

Comment 59 John Poelstra 2007-08-27 18:43:28 UTC
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot3 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed as soon as possible to
ensure that it is included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

More assistance: If you cannot access bugzilla, please reply with a message to
Issue Tracker and I will change the status for you.  If you need assistance
accessing ftp://partners.redhat.com, please contact your Partner Manager.

Comment 60 John Poelstra 2007-08-31 00:26:20 UTC
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot4 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed *as soon as possible*
to ensure that it is included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.


Comment 61 John Poelstra 2007-09-11 19:22:41 UTC
A fix for this issue should have been included in the packages contained in the
RHEL5.1-Snapshot6 on partners.redhat.com.  

Requested action: Please verify that your issue is fixed ASAP to confirm that it
will be included in this update release.

After you (Red Hat Partner) have verified that this issue has been addressed,
please perform the following:
1) Change the *status* of this bug to VERIFIED.
2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified)

If this issue is not fixed, please add a comment describing the most recent
symptoms of the problem you are having and change the status of the bug to FAILS_QA.

If you cannot access bugzilla, please reply with a message to Issue Tracker and
I will change the status for you.  If you need assistance accessing
ftp://partners.redhat.com, please contact your Partner Manager.

Comment 63 errata-xmlrpc 2007-11-07 19:43:12 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on the solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHBA-2007-0959.html



Note You need to log in before you can comment on or make changes to this bug.