Description of problem: Misc soft lockup messages in dmesg when running lspp.67 kernel uname -a: Linux bracer2.ltc.austin.ibm.com 2.6.18-8.el5.lspp.67 #1 SMP Mon Feb 26 11:13:51 EST 2007 x86_64 x86_64 x86_64 GNU/Linux Version-Release number of selected component (if applicable): 2.6.18-8.el5.lspp.67 How reproducible: very Steps to Reproduce: 1. load bnx2 driver 2. check dmesg and/or console Actual results: bnx2: eth0: using MSI ADDRCONF(NETDEV_UP): eth0: link is not ready bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready audit(1173271835.388:337): audit_pid=2291 old=0 by auid=4294967295 subj=system_u:system_r:auditd_t:s15:c0.c1023 eth0: no IPv6 routers present warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts rip __do_softirq+0x5c/0xf3 BUG: soft lockup detected on CPU#3! Call Trace: <IRQ> [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6 [<ffffffff80097ae9>] update_process_times+0x42/0x68 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70 <EOI> [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8024b333>] ip_mc_inc_group+0x1ca/0x200 [<ffffffff8024b3a7>] ip_mc_up+0x3e/0x57 [<ffffffff8024771e>] inetdev_event+0x137/0x27d [<ffffffff8006877c>] notifier_call_chain+0x20/0x32 [<ffffffff80216d7c>] dev_open+0x68/0x6e [<ffffffff80215488>] dev_change_flags+0x5a/0x119 [<ffffffff80247fc6>] devinet_ioctl+0x235/0x59c [<ffffffff8020db86>] sock_ioctl+0x1c1/0x1e5 [<ffffffff80042103>] do_ioctl+0x21/0x6b [<ffffffff800315b2>] vfs_ioctl+0x256/0x26f [<ffffffff8004c949>] sys_ioctl+0x59/0x78 [<ffffffff8005ddda>] tracesys+0xd1/0xdb bnx2: eth0: using MSI ADDRCONF(NETDEV_UP): eth0: link is not ready bnx2: eth0 NIC Link is Up, 1000 Mbps full duplex ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready eth0: no IPv6 routers present BUG: soft lockup detected on CPU#2! Call Trace: <IRQ> [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6 [<ffffffff80097ae9>] update_process_times+0x42/0x68 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70 <EOI> [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8003f029>] rt_run_flush+0x77/0xb0 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8024b333>] ip_mc_inc_group+0x1ca/0x200 [<ffffffff8024b3a7>] ip_mc_up+0x3e/0x57 [<ffffffff802475d3>] inetdev_init+0x10b/0x11f [<ffffffff80248143>] devinet_ioctl+0x3b2/0x59c [<ffffffff8020db86>] sock_ioctl+0x1c1/0x1e5 [<ffffffff80042103>] do_ioctl+0x21/0x6b [<ffffffff800315b2>] vfs_ioctl+0x256/0x26f [<ffffffff8004c949>] sys_ioctl+0x59/0x78 [<ffffffff8005ddda>] tracesys+0xd1/0xdb BUG: soft lockup detected on CPU#3! Call Trace: <IRQ> [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6 [<ffffffff80097ae9>] update_process_times+0x42/0x68 [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47 [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56 [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70 <EOI> [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff800a5741>] lock_acquire+0x8d/0x9c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8006607f>] _spin_lock_bh+0x23/0x2c [<ffffffff8003f001>] rt_run_flush+0x4f/0xb0 [<ffffffff8024c6c4>] fib_disable_ip+0x20/0x29 [<ffffffff8024ce63>] fib_inetaddr_event+0x19d/0x1b2 [<ffffffff8006877c>] notifier_call_chain+0x20/0x32 [<ffffffff8009a586>] blocking_notifier_call_chain+0x22/0x36 [<ffffffff80247058>] inet_del_ifa+0x11f/0x1f8 [<ffffffff802479e5>] inet_rtm_deladdr+0xe1/0x103 [<ffffffff8021d155>] rtnetlink_rcv_msg+0x1b7/0x1dd [<ffffffff8022abff>] netlink_run_queue+0x84/0x147 [<ffffffff8021cf9e>] rtnetlink_rcv_msg+0x0/0x1dd [<ffffffff8021cf49>] rtnetlink_rcv+0x30/0x4e [<ffffffff8022b035>] netlink_data_ready+0x12/0x50 [<ffffffff8022a258>] netlink_sendskb+0x26/0x40 [<ffffffff8022b010>] netlink_sendmsg+0x2b4/0x2c7 [<ffffffff800554b7>] sock_sendmsg+0xf3/0x110 [<ffffffff8009ff98>] autoremove_wake_function+0x0/0x2e [<ffffffff800a5638>] __lock_acquire+0x9af/0xa2b [<ffffffff8003ec1e>] sys_recvmsg+0x18f/0x254 [<ffffffff800a4921>] trace_hardirqs_on+0x123/0x14e [<ffffffff8020e73f>] sys_sendto+0x11c/0x14f [<ffffffff800315f2>] __up_write+0x27/0xf2 [<ffffffff8005ddda>] tracesys+0xd1/0xdb Additional info: Machine is an IBM Blade server, 2x Intel Xeon 5110 @ 1.60GHz [root/sysadm_r/SystemLow@bracer2 ~]# lspci 00:00.0 Host bridge: Intel Corporation 5000P Chipset Memory Controller Hub (rev 93) 00:02.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x8 Port 2-3 (rev 93) 00:03.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 3 (rev 93) 00:04.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 4 (rev 93) 00:05.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 5 (rev 93) 00:06.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 6 (rev 93) 00:07.0 PCI bridge: Intel Corporation 5000 Series Chipset PCI Express x4 Port 7 (rev 93) 00:10.0 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93) 00:10.1 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93) 00:10.2 Host bridge: Intel Corporation 5000 Series Chipset FSB Registers (rev 93) 00:11.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 93) 00:13.0 Host bridge: Intel Corporation 5000 Series Chipset Reserved Registers (rev 93) 00:15.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 93) 00:16.0 Host bridge: Intel Corporation 5000 Series Chipset FBD Registers (rev 93) 00:1c.0 PCI bridge: Intel Corporation 631xESB/632xESB/3100 Chipset PCI Express Root Port 1 (rev 09) 00:1d.0 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #1 (rev 09) 00:1d.1 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #2 (rev 09) 00:1d.2 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset UHCI USB Controller #3 (rev 09) 00:1d.7 USB Controller: Intel Corporation 631xESB/632xESB/3100 Chipset EHCI USB2 Controller (rev 09) 00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev d9) 00:1f.0 ISA bridge: Intel Corporation 631xESB/632xESB/3100 Chipset LPC Interface Controller (rev 09) 00:1f.2 IDE interface: Intel Corporation 631xESB/632xESB/3100 Chipset SATA Storage Controller IDE (rev 09) 01:01.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 02:00.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064E PCI-Express Fusion-MPT SAS (rev 02) 03:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 04:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 12) 05:00.0 PCI bridge: Broadcom EPB PCI-Express to PCI-X Bridge (rev c3) 06:00.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5708S Gigabit Ethernet (rev 12) 07:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Upstream Port (rev 01) 07:00.3 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express to PCI-X Bridge (rev 01) 09:00.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E1 (rev 01) 09:01.0 PCI bridge: Intel Corporation 6311ESB/6321ESB PCI Express Downstream Port E2 (rev 01) [root/sysadm_r/SystemLow@bracer2 ~]#
May be totally unrelated, but got another soft lockup in another blade server. This one is an 2x Dual-Core AMD Opteron 2212 HE: </var/log/messages> Mar 7 11:15:46 oracer5 kernel: BUG: soft lockup detected on CPU#2! Mar 7 11:15:46 oracer5 kernel: Mar 7 11:15:46 oracer5 kernel: Call Trace: Mar 7 11:15:46 oracer5 kernel: <IRQ> [<ffffffff800bc58a>] softlockup_tick+0xdb/0xf6 Mar 7 11:15:46 oracer5 kernel: [<ffffffff80097ae9>] update_process_times+0x42/0x68 Mar 7 11:15:46 oracer5 kernel: [<ffffffff80077e11>] smp_local_timer_interrupt+0x23/0x47 Mar 7 11:15:46 oracer5 kernel: [<ffffffff80078552>] smp_apic_timer_interrupt+0x50/0x56 Mar 7 11:15:46 oracer5 kernel: [<ffffffff8005e86f>] apic_timer_interrupt+0x6b/0x70 Mar 7 11:15:46 oracer5 kernel: <EOI> [<ffffffff80007894>] kmem_cache_free+0x22c/0x239 Mar 7 11:15:46 oracer5 kernel: [<ffffffff8012c6eb>] avtab_destroy+0x2f/0x66 Mar 7 11:15:46 oracer5 kernel: [<ffffffff801320e1>] cond_policydb_destroy+0x1c/0x29 Mar 7 11:15:46 oracer5 kernel: [<ffffffff8012d474>] policydb_destroy+0x102/0x1e0 Mar 7 11:15:46 oracer5 kernel: [<ffffffff80065f7b>] _write_unlock_irq+0x26/0x27 Mar 7 11:15:46 oracer5 kernel: [<ffffffff80130cd4>] security_load_policy+0x233/0x28c Mar 7 11:15:46 oracer5 kernel: [<ffffffff8005e1dc>] restore_args+0x0/0x30 Mar 7 11:15:46 oracer5 kernel: [<ffffffff800c872b>] zone_statistics+0x3e/0x6d Mar 7 11:15:46 oracer5 kernel: [<ffffffff8012975f>] sel_write_load+0xbc/0x31b Mar 7 11:15:46 oracer5 kernel: [<ffffffff80016dfa>] vfs_write+0xce/0x174 Mar 7 11:15:46 oracer5 kernel: [<ffffffff800176b6>] sys_write+0x45/0x6e Mar 7 11:15:46 oracer5 kernel: [<ffffffff8005ddda>] tracesys+0xd1/0xdb And in this same system, loading bnx2 module load doesn't cause soft lockups, but this instead: </var/log/messages> Mar 7 11:20:33 oracer5 kernel: bnx2: eth0: No interrupt was generated using MSI, switching to INTx mode. Please report this failure to the PCI maintainer and include system chipset information <lspci> [root/sysadm_r/SystemLow@oracer5 log]# lspci 00:01.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge 00:02.0 Host bridge: Broadcom HT1000 Legacy South Bridge 00:02.2 ISA bridge: Broadcom HT1000 LPC Bridge 00:03.0 USB Controller: Broadcom HT1000 USB Controller (rev 01) 00:03.1 USB Controller: Broadcom HT1000 USB Controller (rev 01) 00:03.2 USB Controller: Broadcom HT1000 USB Controller (rev 01) 00:05.0 VGA compatible controller: ATI Technologies Inc ES1000 (rev 02) 00:06.0 PCI bridge: Broadcom HT2000 PCI-X bridge (rev b0) 00:07.0 PCI bridge: Broadcom HT2000 PCI-X bridge (rev b0) 00:08.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0) 00:09.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0) 00:0a.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0) 00:0b.0 PCI bridge: Broadcom HT2000 PCI-Express bridge (rev b0) 00:18.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 00:19.0 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration 00:19.1 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Address Map 00:19.2 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] DRAM Controller 00:19.3 Host bridge: Advanced Micro Devices [AMD] K8 [Athlon64/Opteron] Miscellaneous Control 01:0d.0 PCI bridge: Broadcom HT1000 PCI/PCI-X bridge (rev c0) 02:04.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706S Gigabit Ethernet (rev 02) 02:05.0 Ethernet controller: Broadcom Corporation NetXtreme II BCM5706S Gigabit Ethernet (rev 02) 03:04.0 SCSI storage controller: LSI Logic / Symbios Logic SAS1064 PCI-X Fusion-MPT SAS (rev 02) [root/sysadm_r/SystemLow@oracer5 log]#
I booted the test kernel and it boots and seems to run fine on my x86_64 targeted system. It didn't have the problem of hanging/pausing when starting udev that I saw the the .69 kernel and I'm not seeing any soft lockups.
In this case the 'test kernel' is actually 2.6.18-12 a devel kernel post GA which 'should' have the softlock detection stuff turned on but includes none of the LSPP patch changes. Is your system doing any auditing?
I have auditing enabled although no audit rules loaded, same as when I was attempting to run the .69 kernel. With the .69 kernel I also had the soft lockup and hang/pause when starting udev when I booted with selinux=0, in case that helps. I eventually got an nmi watchdog panic.
2.6.18-12 also solved the issue that I was having with network_object tests. I also have auditing enabled but no actual rules loaded.
Since this does not exist in 2.6.18-12 or upstream 2.6.21-rc4 it must be in the LSPP patches which makes it my problem. Drat. Assigning to eparis
This is blocking the LSPP eval effort so raising it to high. Also adding Iboverma as cc
So I rebuilt 2.6.18-12 with the same .config I'm using in LSPP and I see these problems on my dl380g5 I diffed the .config I'm using and the one in the -12 rpm which was tested here by IBM and HP and it seems to have 2 differences: --- configs/kernel-2.6.18-x86_64.config 2007-03-22 15:07:33.000000000 -0400 +++ .config 2007-03-22 15:09:05.000000000 -0400 @@ -2793,7 +2792,7 @@ CONFIG_DEBUG_RWSEMS=y CONFIG_DEBUG_LOCK_ALLOC=y CONFIG_PROVE_LOCKING=y CONFIG_LOCKDEP=y -# CONFIG_DEBUG_LOCKDEP is not set +CONFIG_DEBUG_LOCKDEP=y CONFIG_TRACE_IRQFLAGS=y CONFIG_DEBUG_SPINLOCK_SLEEP=y # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set @@ -2803,7 +2802,7 @@ CONFIG_DEBUG_INFO=y CONFIG_DEBUG_FS=y CONFIG_DEBUG_VM=y CONFIG_DEBUG_LIST=y -# CONFIG_FRAME_POINTER is not set +CONFIG_FRAME_POINTER=y # CONFIG_FORCED_INLINING is not set CONFIG_BOOT_DELAY=y # CONFIG_RCU_TORTURE_TEST is not set I'm going to try again turning off LOCK_DEP and see what happens. I'm also going to check out what happened with the 2.6.21-rc4 kernel since i think i had lockdep on there too....
I dropped CONFIG_DEBUG_LOCKDEP rebuilt 2.6.18-12 and it is working just fine. Uggh, going to go do the same on the LSPP kernel then look into why.
I'm running the lspp.70 kernel and my system is much happier now. Thank you.
everything seems fine now here too, thanks and you may close the bug
Works for me too.
As per the call earlier today ... here is the soft lockup I saw ... This occured on a ppc system with the latest everything including latest policy-47 and kernel lspp.70 I was trying to do an semanage command, semanage failed due to context change. I did a restorecon on the modules directory and when I tried the semanage command again, I saw the lockup; the command succeded though. [root/sysadm_r/SystemLow@joy-hv4 ~]# semanage login -m -s staff_u ealuser libsemanage.semanage_commit_sandbox: Error while renaming /etc/selinux/mls/modules/active to /etc/selinux/mls/modules/previous. /usr/sbin/semanage: Could not modify login mapping for ealuser [root/sysadm_r/SystemLow@joy-hv4 ~]# ls -Zd /etc/selinux/mls/modules/active/ drwx------ root root abat_u:object_r:selinux_config_t:SystemLow /etc/selinux/mls/modules/active/ [root/sysadm_r/SystemLow@joy-hv4 ~]# restorecon -r /etc/selinux/mls/modules/ [root/sysadm_r/SystemLow@joy-hv4 ~]# ls -Zd /etc/selinux/mls/modules/active/ drwx------ root root system_u:object_r:semanage_store_t:SystemLow /etc/selinux/mls/modules/active/ [root/sysadm_r/SystemLow@joy-hv4 ~]# semanage login -m -s staff_u ealuser BUG: soft lockup detected on CPU#2! Call Trace: [C00000003EE9AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000003EE9B000] [C0000000000A144C] .softlockup_tick+0xec/0x124 [C00000003EE9B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000003EE9B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000003EE9B200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C00000003EE9B5A0] [C000000000183934] .cond_av_list_destroy+0x20/0x44 [C00000003EE9B630] [C00000000018399C] .cond_node_destroy+0x44/0x70 [C00000003EE9B6C0] [C0000000001839E8] .cond_list_destroy+0x20/0x40 [C00000003EE9B750] [C00000000017CEE0] .policydb_destroy+0x144/0x280 [C00000003EE9B7F0] [C000000000182454] .security_load_policy+0x278/0x31c [C00000003EE9BC10] [C00000000017767C] .sel_write_load+0xd0/0x340 [C00000003EE9BCF0] [C0000000000E203C] .vfs_write+0x118/0x200 [C00000003EE9BD90] [C0000000000E27AC] .sys_write+0x4c/0x8c [C00000003EE9BE30] [C00000000000869C] syscall_exit+0x0/0x40
At this point only the data in comment #13 is of interest. Is it reproducable? If so can you tell me exactly the hardware so I can try to get it here in house? The RHEL5 kernel is very close to upstream in this area. I can't in fact think of any customizations to policy load. Stephen you usually have good thoughts right offhand on this sort of stuff....
We only hold the policy wrlock for the small critical section where we are installing the new policydb and sidtab into place, and we release it before calling policydb_destroy on the old policydb, so I don't see any reason why SELinux would cause such a soft lockup. Given that the earlier soft lockups had nothing to do with SELinux, it doesn't sound selinux-related. I notice that the original bug report included the following: warning: many lost ticks. Your time source seems to be instable or some driver is hogging interupts
(In reply to comment #15) > I notice that the original bug report included the following: > warning: many lost ticks. > Your time source seems to be instable or some driver is hogging interupts > This was true for a x86_64 machine but not for a ppc64 machine as far as I can tell. I'll get back to the machine in comment #0 and see if the lost ticks problem still exist, but I don't think it might be related with comment #13
I was not able to reproduce the soft lockup in comment #13 yet. I tried the command on another system of the same type and didn't see it ... I'll keep trying to reproduce it and if it happens again I'll report it right away.
Mostly solved, one still seen. Eric and SDS don't understand why. CPU was slow getting back, not hung. This will fall by the wayside if somebody cannot reproduce it.
Still not recreated as of today.
Created attachment 152156 [details] soft lockup output
Another soft lockup. I just put an attachement in comment #20. Saw this lockup on ppc running .71 kernel. Will update my kernel and see if I get it again. I was running our tests overnight, and saw the lockup this morning when I came in.
What are those tests doing, particularly that is triggering policy reloads? And what other output happens in /var/log/messages at the time? Not directly related per se, but semanage login shouldn't need to reload policy.
Yet another soft lockup... This also happened over night as I left our test suite running. This is also on ppc running lspp.73 kernel and latest everything (this system was installed fresh yesterday) We have tests that do policy reloads .. I am running them now to maybe see the softlockup when it happens. I'll update this bug. Here is the lockup output [root/sysadm_r/SystemLow@joy-hv4 ~]# security: invalidating context syssecadm_u:sysadm_r:sysadm_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:secadm_r:secadm_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:object_r:sysadm_devpts_t:s0 security: invalidating context syssecadm_u:sysadm_r:hostname_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:sysadm_r:consoletype_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:sysadm_r:newrole_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:syssecadm_r:sysadm_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:sysadm_r:system_chkpwd_t:s0-s15:c0.c1023 security: invalidating context syssecadm_u:object_r:user_home_t:s0 security: invalidating context syssecadm_u:object_r:secadm_devpts_t:s0 BUG: soft lockup detected on CPU#2! Call Trace: [C0000000439C6F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C0000000439C7000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C0000000439C70A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C0000000439C7120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C0000000439C7200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C0000000439C75A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C0000000439C7630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C0000000439C76C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C0000000439C7750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C0000000439C77F0] [C000000000182858] .security_load_policy+0x278/0x31c [C0000000439C7C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C0000000439C7CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C0000000439C7D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C0000000439C7E30] [C00000000000869C] syscall_exit+0x0/0x40 ip_tables: (C) 2000-2006 Netfilter Core Team security: invalidating context tmpuser100960_u:user_r:user_t:s0-s15:c0.c1023 security: invalidating context tmpuser100960_u:user_r:user_t:s0 security: invalidating context tmpuser100960_u:object_r:user_devpts_t:s0 security: invalidating context tmpuser100960_u:user_r:passwd_t:s0 BUG: soft lockup detected on CPU#4! Call Trace: [C00000007838EFF0] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000007838F090] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C00000007838F130] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000007838F1B0] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000007838F290] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4 LR = .kmem_cache_free+0x290/0x2d4 [C00000007838F630] [C00000000017BF30] .avtab_destroy+0x58/0xb0 [C00000007838F6C0] [C0000000001844CC] .cond_policydb_destroy+0x30/0x4c [C00000007838F750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C00000007838F7F0] [C000000000182858] .security_load_policy+0x278/0x31c [C00000007838FC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C00000007838FCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C00000007838FD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C00000007838FE30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#1! Call Trace: [C0000000772FAF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C0000000772FB000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C0000000772FB0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C0000000772FB120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C0000000772FB200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C0000000772FB5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C0000000772FB630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C0000000772FB6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C0000000772FB750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C0000000772FB7F0] [C000000000182858] .security_load_policy+0x278/0x31c [C0000000772FBC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C0000000772FBCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C0000000772FBD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C0000000772FBE30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#4! Call Trace: [C00000000754EF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000000754F000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C00000000754F0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000000754F120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000000754F200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C00000000754F5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C00000000754F630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C00000000754F6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C00000000754F750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C00000000754F7F0] [C000000000182858] .security_load_policy+0x278/0x31c [C00000000754FC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C00000000754FCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C00000000754FD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C00000000754FE30] [C00000000000869C] syscall_exit+0x0/0x40 security: invalidating context tmpuser100961_u:user_r:user_t:s0-s15:c0.c1023 security: invalidating context tmpuser100961_u:user_r:user_t:s8 security: invalidating context tmpuser100961_u:object_r:user_devpts_t:s8 security: invalidating context tmpuser100961_u:user_r:passwd_t:s8 BUG: soft lockup detected on CPU#0! Call Trace: [C00000006BA6AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000006BA6B000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C00000006BA6B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000006BA6B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000006BA6B200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C00000006BA6B5A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C00000006BA6B630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C00000006BA6B6C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C00000006BA6B750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C00000006BA6B7F0] [C000000000182858] .security_load_policy+0x278/0x31c [C00000006BA6BC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C00000006BA6BCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C00000006BA6BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C00000006BA6BE30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#0! Call Trace: [C000000083BE2F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C000000083BE3000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C000000083BE30A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C000000083BE3120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C000000083BE3200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C000000083BE35A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C000000083BE3630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C000000083BE36C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C000000083BE3750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C000000083BE37F0] [C000000000182858] .security_load_policy+0x278/0x31c [C000000083BE3C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C000000083BE3CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C000000083BE3D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C000000083BE3E30] [C00000000000869C] syscall_exit+0x0/0x40 security: invalidating context tmpuser100962_u:user_r:user_t:s0-s15:c0.c1023 security: invalidating context tmpuser100962_u:user_r:user_t:s15:c0.c1023 security: invalidating context tmpuser100962_u:object_r:user_devpts_t:s15:c0.c1023 security: invalidating context tmpuser100962_u:user_r:passwd_t:s15:c0.c1023 BUG: soft lockup detected on CPU#5! Call Trace: [C000000083BE2F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C000000083BE3000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C000000083BE30A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C000000083BE3120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C000000083BE3200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C000000083BE35A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C000000083BE3630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C000000083BE36C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C000000083BE3750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C000000083BE37F0] [C000000000182858] .security_load_policy+0x278/0x31c [C000000083BE3C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C000000083BE3CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C000000083BE3D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C000000083BE3E30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#5! Call Trace: [C00000006204B130] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000006204B1D0] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C00000006204B270] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000006204B2F0] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000006204B3D0] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kmem_cache_free+0x10/0x2d4 LR = .avtab_destroy+0x58/0xb0 [C00000006204B750] [C00000000017D224] .policydb_destroy+0xb4/0x280 [C00000006204B7F0] [C000000000182858] .security_load_policy+0x278/0x31c [C00000006204BC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C00000006204BCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C00000006204BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C00000006204BE30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#5! Call Trace: [C000000062043080] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C000000062043120] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C0000000620431C0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C000000062043240] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C000000062043320] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4 LR = .kmem_cache_free+0x290/0x2d4 [C0000000620436C0] [C00000000017BF30] .avtab_destroy+0x58/0xb0 [C000000062043750] [C00000000017D224] .policydb_destroy+0xb4/0x280 [C0000000620437F0] [C000000000182858] .security_load_policy+0x278/0x31c [C000000062043C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C000000062043CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C000000062043D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C000000062043E30] [C00000000000869C] syscall_exit+0x0/0x40 INIT: version 2.86 reloading
So, looking at linux-kernel for similar kinds of reports, I see: http://marc.info/?l=linux-kernel&m=117509781219133&w=2 That was for a "bogus" softlockup warning during sysrq-t due to it taking a long time during the dump to console while holding the tasklist lock. In our case, we are performing a fair amount of processing for the policy reload itself (parsing the image, building up a new policydb, validating class and perm values, cloning and converting the sidtab, tearing down the old policydb/sidtab) plus printk output for invalidated contexts, while holding sel_mutex in selinuxfs to synchronize with other loads or boolean accesses, load_mutex in security_load_policy() to synchronize with other loads (hmmm...that seems obsoleted by sel_mutex in the caller), and taking policy_wrlock and sidtab lock for very short, fixed sequences. sel_mutex and the seemingly obsolete load_mutex only interfere with other loads or boolean ops, not other processing. policy_wrlock and sidtab lock are only held for short, fixed durations in the policy loading code, but could be forced to wait a bit for other threads to release them. Possibly the locking could be refactored, although most operations don't reach the security server at all and just use the AVC and already computed SIDs. Short term workaround might be to touch the softlockup watchdog in the policy load code. Or just shut it off for production.
And just to clarify, does it truly lockup ever or it is just taking a long time to complete?
It does not actually lockup, it shows the soft lockup on the screen but the test case continues to execute and completes Basically the test case is exercising the following - semodule -R/i/u/r I was able to run the test case individually and caused the soft lockup once. A second run didn't cause a lockup. looking in /var/log/messages I don't see anything more than the soft lockup info (as above) and the policy info .. as in .. Apr 11 06:17:36 joy-hv4 kernel: security: 7 users, 8 roles, 1772 types, 95 bools, 16 sens, 1024 cats Apr 11 06:17:36 joy-hv4 kernel: security: 59 classes, 148384 rules Apr 11 06:17:46 joy-hv4 kernel: BUG: soft lockup detected on CPU#0! Apr 11 06:17:46 joy-hv4 kernel: Call Trace: Apr 11 06:17:46 joy-hv4 kernel: [C000000047D47080] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) Apr 11 06:17:46 joy-hv4 kernel: [C000000047D47120] [C0000000000A1820] .softlockup_tick+0xec/0x124 ... ....
I think the softlockups are the watchdog timer timing out. I'd think that the simplest solution since we keep softlockup detection on for production kernels is to call the touch watchdog function.
Created attachment 152482 [details] Patch to touch softlockup watchdog during policy load. This patch adds calls to touch the softlockup watchdog just before cloning the SID table and just before tearing down the old policydb/sidtab (or new ones on the error path). See if this helps.
Patch was applied to lspp.74 kernel. Can this please be re-tested? Thanks.
Thanks Steve, I just installed the .74, relabeled and rebooted .. I tried my test and I still see a soft lockup ... The test still continues to execute and finishes. I ran the test twice and got the lockup both times [root/sysadm_r/SystemLow@joy-hv4 ~]# BUG: soft lockup detected on CPU#4! Call Trace: [C0000000388BAF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C0000000388BB000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C0000000388BB0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C0000000388BB120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C0000000388BB200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C0000000388BB5A0] [C000000000183D20] .cond_av_list_destroy+0x20/0x44 [C0000000388BB630] [C000000000183D88] .cond_node_destroy+0x44/0x70 [C0000000388BB6C0] [C000000000183DD4] .cond_list_destroy+0x20/0x40 [C0000000388BB750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C0000000388BB7F0] [C000000000182868] .security_load_policy+0x288/0x334 [C0000000388BBC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C0000000388BBCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C0000000388BBD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C0000000388BBE30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#5! Call Trace: [C00000000FF2AF60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C00000000FF2B000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C00000000FF2B0A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C00000000FF2B120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C00000000FF2B200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0x294/0x2d8 [C00000000FF2B5A0] [C000000000183D20] .cond_av_list_destroy+0x20/0x44 [C00000000FF2B630] [C000000000183D88] .cond_node_destroy+0x44/0x70 [C00000000FF2B6C0] [C000000000183DD4] .cond_list_destroy+0x20/0x40 [C00000000FF2B750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C00000000FF2B7F0] [C000000000182868] .security_load_policy+0x288/0x334 [C00000000FF2BC10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C00000000FF2BCF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C00000000FF2BD90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C00000000FF2BE30] [C00000000000869C] syscall_exit+0x0/0x40
What is your hardware? And you aren't running within a virtual machine, are you?
http://marc.info/?l=linux-kernel&m=117503459125682&w=2 http://marc.info/?l=linux-kernel&m=117627205621964&w=2
I am not running a virtual machine. I am on a ppc system .. uname -a ==>> 2.6.18-8.1.1.lspp.74.el5 #1 SMP Thu Apr 12 14:19:44 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux [root/abat_r/SystemLow@joy-hv4 LTP]# cat /proc/cpuinfo processor : 0 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 1 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 2 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 3 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 4 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 5 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 6 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) processor : 7 cpu : POWER5 (gr) clock : 1502.496000MHz revision : 2.2 (pvr 003a 0202) timebase : 188044000 platform : pSeries machine : CHRP IBM,9124-720
May be unrelated but I thought I should report here anyway. This was copied from a 3270 terminal right after updating a s390x: ============================================= [ INFO: possible recursive locking detected ] 2.6.18-8.1.1.lspp.75.el5 #1 --------------------------------------------- dasdfmt/1537 is trying to acquire lock: (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c but task is already holding lock: (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c other info that might help us debug this: 1 lock held by dasdfmt/1537: #0: (&bdev->bd_mutex){--..}, at: [<000000000024cbe6>] mutex_lock+0x3e/0x4c stack backtrace: 0000000079c1b560 0000000079c1b5f0 0000000000000002 0000000000000000 0000000079c1b690 0000000079c1b608 0000000079c1b608 000000000001663a 0000000000000000 0000000000000000 000000000000000a 000000000000000a 0000000000000000 0000000079c1b5f0 0000000079c1b5f0 0000000079c1b668 000000000025dd80 000000000001663a 0000000079c1b5f0 0000000079c1b648 Call Trace: ([<0000000000016586>] show_trace+0x122/0x13c) [<0000000000016658>] show_stack+0xb8/0xc8 [<0000000000016696>] dump_stack+0x2e/0x3c [<000000000005e6ce>] __lock_acquire+0xab6/0xb18 [<000000000005eec4>] lock_acquire+0x88/0xac [<000000000024c9bc>] __mutex_lock_slowpath+0x130/0x31c [<000000000024cbe6>] mutex_lock+0x3e/0x4c [<0000000000159930>] blkdev_ioctl+0x64c/0x7f0 [<00000000000b55ec>] ioctl_by_bdev+0x74/0xbc [<000000008085c686>] dasd_destroy_partitions+0x66/0xc0 [dasd_mod] [<0000000080855c1a>] dasd_change_state+0x322/0x664 [dasd_mod] [<0000000080855fe2>] dasd_set_target_state+0x86/0x94 [dasd_mod] [<00000000808595d0>] dasd_ioctl+0x124/0x6c4 [dasd_mod] [<000000000015926e>] blkdev_driver_ioctl+0x8e/0xa8 [<0000000000159a6e>] blkdev_ioctl+0x78a/0x7f0 [<00000000000b5676>] block_ioctl+0x42/0x54 [<00000000000c3da6>] do_ioctl+0x56/0xb8 [<00000000000c413a>] vfs_ioctl+0x332/0x350 [<00000000000c41d2>] sys_ioctl+0x7a/0xa0 [<000000000001fcec>] sysc_tracego+0xe/0x14 [<0000004f3b2f2c82>] 0x4f3b2f2c82
lspp.76 kernel was built in attempt to solve this problem. It increaese the NMI watchdog timer value. Please give it a try and let us know if the problem is solved or still exists. Thanks.
is this kernet built with debugging symbols? It's size decreased from 67MB to just 16MB.
I can't reproduce this anymore in my JS21. As Steve requested, this is the 'time' value for the load_policy, when going ===> from strict to mls: [root/user_r/@js21racer1 ~]# time load_policy real 0m1.666s user 0m0.146s sys 0m1.356s [root/user_r/@js21racer1 ~]# ===> from mls to strict: [root/user_r/@js21racer1 ~]# time load_policy libsepol.sepol_genbools_array: boolean allow_httpd_auditadm_script_anon_write no longer in policy libsepol.sepol_genbools_array: boolean allow_httpd_secadm_script_anon_write no longer in policy libsepol.sepol_genbools_array: boolean allow_ipsec_label no longer in policy libsepol.sepol_genbools_array: boolean allow_netlabel no longer in policy real 0m1.430s user 0m0.160s sys 0m1.263s [root/user_r/@js21racer1 ~]# ==> from strict to targeted: [root/user_r/@js21racer1 ~]# time load_policy libsepol.sepol_genbools_array: boolean allow_gpg_execstack no longer in policy libsepol.sepol_genbools_array: boolean allow_httpd_staff_script_anon_write no longer in policy libsepol.sepol_genbools_array: boolean allow_httpd_sysadm_script_anon_write no longer in policy libsepol.sepol_genbools_array: boolean allow_httpd_user_script_anon_write no longer in policy libsepol.sepol_genbools_array: boolean allow_mplayer_execstack no longer in policy libsepol.sepol_genbools_array: boolean allow_ptrace no longer in policy libsepol.sepol_genbools_array: boolean allow_ssh_keysign no longer in policy libsepol.sepol_genbools_array: boolean allow_user_mysql_connect no longer in policy libsepol.sepol_genbools_array: boolean allow_write_xshm no longer in policy libsepol.sepol_genbools_array: boolean cdrecord_read_content no longer in policy libsepol.sepol_genbools_array: boolean cron_can_relabel no longer in policy libsepol.sepol_genbools_array: boolean disable_evolution_trans no longer in policy libsepol.sepol_genbools_array: boolean disable_games_trans no longer in policy libsepol.sepol_genbools_array: boolean disable_mozilla_trans no longer in policy libsepol.sepol_genbools_array: boolean disable_thunderbird_trans no longer in policy libsepol.sepol_genbools_array: boolean mail_read_content no longer in policy libsepol.sepol_genbools_array: boolean mozilla_read_content no longer in policy libsepol.sepol_genbools_array: boolean pppd_for_user no longer in policy libsepol.sepol_genbools_array: boolean read_untrusted_content no longer in policy libsepol.sepol_genbools_array: boolean run_ssh_inetd no longer in policy libsepol.sepol_genbools_array: boolean secure_mode no longer in policy libsepol.sepol_genbools_array: boolean spamassassin_can_network no longer in policy libsepol.sepol_genbools_array: boolean ssh_sysadm_login no longer in policy libsepol.sepol_genbools_array: boolean staff_read_sysadm_file no longer in policy libsepol.sepol_genbools_array: boolean user_direct_mouse no longer in policy libsepol.sepol_genbools_array: boolean user_dmesg no longer in policy libsepol.sepol_genbools_array: boolean user_net_control no longer in policy libsepol.sepol_genbools_array: boolean user_ping no longer in policy libsepol.sepol_genbools_array: boolean user_rw_noexattrfile no longer in policy libsepol.sepol_genbools_array: boolean user_tcp_server no longer in policy libsepol.sepol_genbools_array: boolean user_ttyfile_stat no longer in policy libsepol.sepol_genbools_array: boolean write_untrusted_content no longer in policy libsepol.sepol_genbools_array: boolean xdm_sysadm_login no longer in policy real 0m1.496s user 0m0.078s sys 0m1.268s [root/user_r/@js21racer1 ~]# (no softlockups seen on the console or messages log. This machine is a ppc64 2xPPC970MP dual-core)
Loulwa, please restest this and comment.
It shouldn't be occuring any more, since the time limit on NMI watchdog hitting is now consistent with RHEL3/4 behavior and at 30 seconds. I spent some time looking at this on Friday - please do CC me on any non-BZ discussion around this bug. Note that in a future RHEL5 update, the NMI watchdog will likely default to 30 seconds, and may in fact default to disabled on some platforms. Comments and input on that kind of fix are, of course, very welcome. Jon.
Sorry I was out last week and the patches for this got a little off track this week. Kernels .74 and .75 had a patch which should have caused these messages to not appeared. This patch was pulled in .76 and a patch to lengthen the nmi watchdog timeout was added. Although this is a good patch to have (some hardware, core 2 duo systems in particular) this isn't the nmi timer that is the problem. This soft lockup code is triggered by the normal timer interrupt and is inside softlockup_tick() So there is nothing in .76 which should change the soft lock up detection message discussed in this bug. I'm going to take a look this week to see what I can find why sel_load_policy takes 10+ seconds but just making the messages go away might be the way we go in the LSPP kernel. Locking changes as discussed by sds might be the right way to go upstream, but for the LSPP kernel something a bit quicker (and safer) will probably be the right thing to do. All information anyone has on how to reproduce this will be a lot of help. I've heard it only happens on 8 way systems and the test case is in comment #26. Possibly what is included in the policy module in question could be useful. So really testing to see if it can be hit on .74/.75 and if not if it comes back on .76 could be interesting, but i'm not sure it will help me track down the problem.....
I am running on the latest kernel .76 with latest packages and policy. I still see the soft lockup. Again I am running the same test case which basically does a "semodule -R/u/i/r" as different users. [root/sysadm_r/SystemLow@joy-hv4 ~]# BUG: soft lockup detected on CPU#5! Call Trace: [C000000046312F60] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C000000046313000] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C0000000463130A0] [C00000000007339C] .run_local_timers+0x1c/0x30 [C000000046313120] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C000000046313200] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kfree+0x2b0/0x2d8 LR = .kfree+0xe0/0x2d8 [C0000000463135A0] [C000000000183D08] .cond_av_list_destroy+0x20/0x44 [C000000046313630] [C000000000183D70] .cond_node_destroy+0x44/0x70 [C0000000463136C0] [C000000000183DBC] .cond_list_destroy+0x20/0x40 [C000000046313750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C0000000463137F0] [C000000000182858] .security_load_policy+0x278/0x31c [C000000046313C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C000000046313CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C000000046313D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C000000046313E30] [C00000000000869C] syscall_exit+0x0/0x40 BUG: soft lockup detected on CPU#3! Call Trace: [C000000002316FF0] [C0000000000109C0] .show_stack+0x68/0x1b0 (unreliable) [C000000002317090] [C0000000000A1820] .softlockup_tick+0xec/0x124 [C000000002317130] [C00000000007339C] .run_local_timers+0x1c/0x30 [C0000000023171B0] [C000000000023CF8] .timer_interrupt+0xa8/0x464 [C000000002317290] [C0000000000034EC] decrementer_common+0xec/0x100 --- Exception: 901 at .kmem_cache_free+0x2ac/0x2d4 LR = .kmem_cache_free+0xdc/0x2d4 [C000000002317630] [C00000000017BF30] .avtab_destroy+0x58/0xb0 [C0000000023176C0] [C0000000001844CC] .cond_policydb_destroy+0x30/0x4c [C000000002317750] [C00000000017D2B4] .policydb_destroy+0x144/0x280 [C0000000023177F0] [C000000000182858] .security_load_policy+0x278/0x31c [C000000002317C10] [C000000000177A50] .sel_write_load+0xd0/0x340 [C000000002317CF0] [C0000000000E2410] .vfs_write+0x118/0x200 [C000000002317D90] [C0000000000E2B80] .sys_write+0x4c/0x8c [C000000002317E30] [C00000000000869C] syscall_exit+0x0/0x40 [root/sysadm_r/SystemLow@joy-hv4 ~]# uname -a Linux joy-hv4.ltc.austin.ibm.com 2.6.18-8.1.1.lspp.76.el5 #1 SMP Fri Apr 13 16:12:34 EDT 2007 ppc64 ppc64 ppc64 GNU/Linux Please let me know what else I can do to help. I have tried this on a .74 kernel and saw the soft lockup as well (as in comment #30), but I did not try the .75 kernel.
No surprise .76 still shows the problem, there are no patches in .76 which address this issue. comment #30 is interesting (thanks for pointing it out) as it greatly shinks the area of code where the delay is taking place. I'm going to look for an 8 way machine and start code diving to see anything that might take so much time.....
Got some hardware which might be close to yours, going to start playing with this myself. How much memory does your system have? But with the knowledge in comment #30 we can see that the 'problem' is in policydb_destroy() which does little other than run lists and kfree memory. I'm sure the problem is made worse by the fact that we have the slab debugging on thus making kfree more expensive. Would it be possible to get some oprofile data from you on this? I'd prefer you run the .75 kernel (not .76) since it has the soft lock touch right before policydb_destroy(). Might be a little harder to trigger but will be 'better.' I dropped the kernel and debuginfo on my people page: http://people.redhat.com/eparis/LSPP-PPC-Debuginfo/ should be as simple as something like installing the kernel and debuginfos then opcontrol --setup --vmlinux=/usr/lib/debug/lib/modules/2.6.18-8.el5/vmlinux opcontrol --start $DO_TEST opcontrol --stop opreport -l /usr/lib/debug/lib/modules/2.6.18-8.el5/vmlinux > $OUTPUT_FILE and then give me the contents of the output_file. Thanks, Eric
Created attachment 152964 [details] Profiler run with .75 kernel
Comment on attachment 152964 [details] Profiler run with .75 kernel I installed the lspp.75 kernel as requested above, I ran my test case twice and was unable to get the soft-lockup. However I wanted to run the test case with the profiler running anyway and that's when I saw the soft-lockup. Attached is the profiler output.
I assume the softlock was in policydb_destroy again?
yes .. in policydb_destroy. The lockup is exactly the same as in the previous posts, so I didn't want to post it again and again
After some looking and some private discussions with others I think I'm not going to do anything with this bug. The top 2 time functions (by FAR) in your test were: 121202 23.9879 .check_poison_obj 78247 15.4864 .check_slabp accounting for 39% of the time. Since in the final LSPP kernel debug functions like these functions won't be there at all I see this as a non-issue. Lets leave this open until the 'final' build comes along without any debug options. Long term I'm going to put it on the list to look at the way we allocate/free memory for the policydb but I don't see any reason at all why the issue under discussion here is problomatic or troubling from either a functionality or certification point of view.
lspp.77 kernel was built with a new patch addressing this issue. Can someone check it and see if the problem has been solved? Thanks.
It was reported on IRC that lspp.77 fixed the problem. Removing from tracker.
in 2.6.18-27.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5
(In reply to comment #54) > in 2.6.18-27.el5 > You can download this test kernel from http://people.redhat.com/dzickus/el5 Sorry, I lost track of what this fix is for. The issue was originally reported for a LSPP kernel. Is this fix specific for ppc64 or more arches? Is the testcase (unload/load policy) still valid? Thanks, Klaus K.
The fix was general, not ppc specific. It appeared to only be reproducible on a specific PPC machine during one particular IBM test suite which was loading and unloading an empty policy module. I was never actually able to trigger the issue on a similar machine here inside Red Hat. Its going to be a hard one to say 'it is fixed' since you can't tell if the 'fix' in that situation was from the removal of the debug options or from the addition of voluntary preemption points....
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot3 on partners.redhat.com. Requested action: Please verify that your issue is fixed as soon as possible to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. More assistance: If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot4 on partners.redhat.com. Requested action: Please verify that your issue is fixed *as soon as possible* to ensure that it is included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
A fix for this issue should have been included in the packages contained in the RHEL5.1-Snapshot6 on partners.redhat.com. Requested action: Please verify that your issue is fixed ASAP to confirm that it will be included in this update release. After you (Red Hat Partner) have verified that this issue has been addressed, please perform the following: 1) Change the *status* of this bug to VERIFIED. 2) Add *keyword* of PartnerVerified (leaving the existing keywords unmodified) If this issue is not fixed, please add a comment describing the most recent symptoms of the problem you are having and change the status of the bug to FAILS_QA. If you cannot access bugzilla, please reply with a message to Issue Tracker and I will change the status for you. If you need assistance accessing ftp://partners.redhat.com, please contact your Partner Manager.
An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on the solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHBA-2007-0959.html