Description of problem: We have a system running 6 virtual servers under Xen on x86_64 architecture (HP ProLiant 385 server with 2 dual core Opteron processors). Recently we've decided to start using SELinux on guest machines. On most of them we've prepared custom policy binary modules (based on avc entries from kernel log) and loaded them into kernel with semodule tool. On two of them, we switched SELinux to enforcing mode (one has the custom module loaded, one is running vanilla policy). After those changes, the servers started crashing, all at the same time (together with the zero domain). We've been able to capture a kernel stack trace once: Kernel BUG at drivers/xen/netfront/netfront.c:663 invalid opcode: 0000 [1] SMP CPU 0 Modules linked in: ipv6 xennet ip_conntrack_netbios_ns ipt_REJECT ipt_LOG ipt_re cent xt_state ip_conntrack nfnetlink xt_tcpudp iptable_filter ip_tables x_tables dm_mirror dm_mod Pid: 8, comm: xenwatch Not tainted 2.6.17-1.2187_FC5xenU #1 RIP: e030:[<ffffffff88055b57>] <ffffffff88055b57>{:xennet:network_alloc_rx_buffe rs+507} RSP: e02b:ffff880000573df8 EFLAGS: 00010086 RAX: 0000000000000000 RBX: ffff880000eb6980 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000001040 RBP: ffff880008da0580 R08: 0000000000000010 R09: 00000000000000ac R10: ffff880008da06c8 R11: 0000000000000000 R12: 0000000000000208 R13: ffff880008da06c8 R14: 00000000000000ff R15: ffff880008da5a50 FS: 00002aaaaaab78b0(0000) GS:ffffffff80524000(0000) knlGS:0000000000000000 CS: e033 DS: 0000 ES: 0000 Process xenwatch (pid: 8, threadinfo ffff880000572000, task ffff88000056c100) Stack: ffff880000573e18 ffff880008da0000 0000010000000001 ffffffff8026b51c 00000000000000cc 000000348024d92e ffffffff8036db0a 000000000056c100 000000000000000f ffff880008da0000 Call Trace: <ffffffff8026b51c>{_spin_unlock_irqrestore+9} <ffffffff8036db0a>{xenwatch_thread+169} <ffffffff88056130>{:xennet:backen d_changed+668} <ffffffff8036da61>{xenwatch_thread+0} <ffffffff802975d1>{keventd_create_k thread+0} <ffffffff8036d251>{xenwatch_handle_callback+21} <ffffffff8036dbed>{xenwat ch_thread+396} <ffffffff802977d4>{autoremove_wake_function+0} <ffffffff802975d1>{keventd _create_kthread+0} <ffffffff80237e0d>{kthread+212} <ffffffff802670ae>{child_rip+8} <ffffffff802975d1>{keventd_create_kthread+0} <ffffffff80237d39>{kthread+0 } <ffffffff802670a6>{child_rip+0} Code: 0f 0b 68 88 7c 05 88 c2 97 02 8b 85 f0 00 00 00 4c 63 f2 48 RIP <ffffffff88055b57>{:xennet:network_alloc_rx_buffers+507} RSP <ffff880000573d f8> <3>BUG: sleeping function called from invalid context at include/linux/rwsem.h: 43 in_atomic():0, irqs_disabled():1 Call Trace: <ffffffff80291b41>{blocking_notifier_call_chain+31} <ffffffff80217f42>{do_exit+32} <ffffffff802706bd>{kernel_math_error+0} <ffffffff80270fcc>{do_invalid_op+163} <ffffffff88055b57>{:xennet:network_ alloc_rx_buffers+507} <ffffffff80210cf6>{__alloc_pages+271} <ffffffff80266daf>{error_exit+0} <ffffffff88055b57>{:xennet:network_alloc_rx_buffers+507} <ffffffff88055be4>{:xennet:network_alloc_rx_buffers+648} <ffffffff8026b51c>{_spin_unlock_irqrestore+9} <ffffffff8036db0a>{xenwatch _thread+169} <ffffffff88056130>{:xennet:backend_changed+668} <ffffffff8036da61>{xenwat ch_thread+0} <ffffffff802975d1>{keventd_create_kthread+0} <ffffffff8036d251>{xenwatch_ handle_callback+21} <ffffffff8036dbed>{xenwatch_thread+396} <ffffffff802977d4>{autoremove_wak e_function+0} <ffffffff802975d1>{keventd_create_kthread+0} <ffffffff80237e0d>{kthread+2 12} <ffffffff802670ae>{child_rip+8} <ffffffff802975d1>{keventd_create_kthread +0} <ffffffff80237d39>{kthread+0} <ffffffff802670a6>{child_rip+0} BUG: xenwatch/8, lock held at task exit time! [ffffffff80474400] {xenwatch_mutex} .. held by: xenwatch: 8 [ffff88000056c100, 110] ... acquired at: xenwatch_thread+0xa9/0x1a5 Version-Release number of selected component (if applicable): 2.6.17-1.2187_FC5xen0 How reproducible: Usually systems crash every 2/3 days.
It's difficult to imagine how SELinux might be able to cause this: Kernel BUG at drivers/xen/netfront/netfront.c:663 invalid opcode: 0000 [1] SMP Have you seen anything corresponding in the audit logs?
The audit logs on Domain0 show the following entries at the time of crash: time->Mon Oct 9 02:24:34 2006 type=PATH msg=audit(1160353474.279:1269): item=0 name="/var/xen/someguestdomain.img" inode=48037918 dev=68:03 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:var_t:s0 type=CWD msg=audit(1160353474.279:1269): cwd="/" type=SYSCALL msg=audit(1160353474.279:1269): arch=c000003e syscall=21 success=yes exit=0 a0=5f45e0 a1=4 a2=1 a3=5f45e0 items=1 pid=1931 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) comm="python" exe="/usr/bin/python" subj=system_u:system_r:xend_t:s0 type=AVC msg=audit(1160353474.279:1269): avc: denied { read } for pid=1931 comm="python" name="someguestdomain.img" dev=cciss/c0d0p3 ino=48037918 scontext=system_u:system_r:xe nd_t:s0 tcontext=root:object_r:var_t:s0 tclass=file ---- time->Mon Oct 9 02:24:36 2006 type=SYSCALL msg=audit(1160353476.127:1270): arch=c000003e syscall=16 success=yes exit=0 a0=3 a1=89a3 a2=7ffffd471fb0 a3=0 items=0 pid=20816 auid=4294967295 uid=0 gid=0 e uid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) comm="brctl" exe="/usr/sbin/brctl" subj=system_u:system_r:udev_t:s0-s0:c0.c255 type=ANOM_PROMISCUOUS msg=audit(1160353476.127:1270): dev=vif8.0 prom=0 old_prom=256 auid=4294967295 ---- time->Mon Oct 9 02:24:50 2006 type=PATH msg=audit(1160353490.148:1271): item=0 name="/var/xen/someguestdomain.img" inode=48037918 dev=68:03 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=root:object_r:var_t:s0 type=CWD msg=audit(1160353490.148:1271): cwd="/" type=SYSCALL msg=audit(1160353490.148:1271): arch=c000003e syscall=21 success=yes exit=0 a0=6972d0 a1=4 a2=1 a3=6972d0 items=1 pid=1931 auid=4294967295 uid=0 gid=0 euid=0 suid=0 fsuid=0 egid=0 sgid=0 fsgid=0 tty=(none) comm="python" exe="/usr/bin/python" subj=system_u:system_r:xend_t:s0 type=AVC msg=audit(1160353490.148:1271): avc: denied { read } for pid=1931 comm="python" name="someguestdomain.img" dev=cciss/c0d0p3 ino=48037918 scontext=system_u:system_r:xe nd_t:s0 tcontext=root:object_r:var_t:s0 tclass=file Also, at the moment of the crash domain0 has logged the following in the kernel log: Oct 9 02:24:35 domzero kernel: xenbr0: port 8(vif8.0) entering disabled state Oct 9 02:24:36 domzero kernel: device vif8.0 left promiscuous mode Oct 9 02:24:36 domzero kernel: xenbr0: port 8(vif8.0) entering disabled state Domain0 has continued running in some broken state, and kept logging e.g. netfilter messages, but the guest domains were dead after this event. We always run SElinux in permissive mode on Domain0.
The file, someguestdomain.img, needs to be labeled as something that xend can read. If you need assistance with that, please advise. Just curious, why do you use permissive mode?
This is a known problem with the 2.6.17 kernel in FC5. It should be fixed with the 2.6.18 kernel in FC5-testing (2189 or later). So please give that a try.
Created attachment 138238 [details] var/log/messages excerpt from attempt to boot kernel-xen0.x86_64 2.6.18-1.2189.fc5 Attachment is an attempt to boot kernel-xen0.x86_64 2.6.18-1.2189.fc5 which was exceedingly slow, and after several minutes had not started sshd, nor enabled console logon, nor responsive to ctrl-alt-del, so we had to force power off. Like the original poster my server is also a Compaq DL385 with single Opteron 280, 1GB memory. I had been running one xen guest with SELINUX=enforcing on build 2122 kernel (xen0 and xenU) for months with no instability. After updating to 2187 and updating to xen-3.0.2-3.FC5 this weekend, I experienced multiple xenU crashes in a 24 hour period. And in each case it was not possible to create the guest successfully until after rebooting the xen0 system. The 2189 kernel did not result in a usable xen0 system. Changing the xenU selinux config to SELINUX=permissive and booting xen0 on 2187 resulted in both xen0 and xenU operational for 17 hours now. Time will tell whether this is a definitive avoidance of crash conditions.
SELINUX=permissive on xen0 and xenU is not sufficient to avoid xenU crashing. I did not make 18 hours. Name ID Mem(MiB) VCPUs State Time(s) Domain-0 0 712 2 r----- 754.3 Zombie-www 2 256 1 ----cd 27.8
Created attachment 138242 [details] Excerpt of xend.log from xenU crash on 2187 and selinux permissive
Created attachment 138274 [details] Excerpt from our log Same here, attaching xend.log excerpt. SELinux was set to permissive on all domains.
I have the following line in "xm list" output too: Name ID Mem(MiB) VCPUs State Time(s) Domainname-zacheta 8 512 1 ----cd 632.5
Well, the 2187 is known to contain broken xen networking code that results in crashes that you are all seeing. The 2189 kernel contains fixes to that problem. If you are having problems installing the 2189 kernel, please report them separately. Thanks. *** This bug has been marked as a duplicate of 206630 ***
*** This bug has been marked as a duplicate of 199944 ***
The 2189 kernel is not compatible with any currently available FC5 xen userspace packages. Not sure how it can be considered to be a fix.
Please note that the bug 209910 was closed not because it has been resolved, but because it is a duplicate of 199944 which is still open. If you're having problems using the 2189 kernel, please file bugs on that so that they can be resolved. Thanks.