I have been testing the kernel-xen-2.6-2.6.25-0.0.rc4.fc9 from koji, and aside from it taking several attempts to get it to boot (it doesn't always see the disks, which might or might not be related to bug 434760 ), I get the following crash while running yum update during the install packages stage kernel BUG at arch/x86/xen/enlighten.c:708! invalid opcode: 0000 [#1] SMP Modules linked in: nfs lockd nfs_acl rfcomm l2cap bluetooth autofs4 sunrpc ipv6 dm_mirror dm_multipath dm_mod xen_netfront pcspkr xen_blkfront ext3 jbd mbcache uhci_hcd ohci_hcd ehci_hcd Pid: 26913, comm: crond Not tainted (2.6.25-0.0.rc4.fc9xen #1) EIP: 0061:[<c040342f>] EFLAGS: 00010282 CPU: 0 EIP is at xen_release_pt+0x79/0xa9 EAX: ffffffea EBX: c9c34ea0 ECX: 00000001 EDX: 00000000 ESI: 00007ff0 EDI: 0000883e EBP: c9c34eb8 ESP: c9c34ea0 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069 Process crond (pid: 26913, ti=c9c34000 task=cf586000 task.ti=c9c34000) Stack: 00000004 0000a411 00000000 c1458250 0000883e 0883e000 c9c34ecc c041ca01 00000000 00000000 0480f000 c9c34f1c c0479146 089cb067 c10d1000 80000000 4a999fff 4a99a000 c9c34f5c c480f008 4a99a000 cf970d04 c8847500 cf970d04 Call Trace: [<c041ca01>] ? __pmd_free_tlb+0x1a/0x75 [<c0479146>] ? free_pgd_range+0x1d2/0x2b5 [<c04792a7>] ? free_pgtables+0x7e/0x93 [<c047a209>] ? unmap_region+0xb9/0xf5 [<c047af89>] ? do_munmap+0x193/0x1f5 [<c047b01b>] ? sys_munmap+0x30/0x3f [<c0408bda>] ? syscall_call+0x7/0xb ======================= Code: 3e eb 50 a1 f0 fd 84 c0 8b 04 b8 25 ff ff ff 7f 89 45 ec 8d 5d e8 b9 01 00 00 00 31 d2 be f0 7f 00 00 e8 15 9f 40 00 85 c0 74 04 <0f> 0b eb fe c1 e7 0c 8d 87 00 00 00 c0 e8 0a 1b 00 00 eb 14 80 EIP: [<c040342f>] xen_release_pt+0x79/0xa9 SS:ESP 0069:c9c34ea0 ---[ end trace e148c210d8c026ec ]--- The error has repeated itself a few time, seemingly freezing the yum process but commands can still be run from a different terminal session.
Michael: thanks I haven't seen your "doesn't always see the disks" issue before, so please do log a bug on that Adding to PvOpsTracker ..
I have logged the disk issue as bug 436493 . I am guessing that for the current bug that yum isn't directly involved, it just produces sort of conditions for the problem to show itself. It has also only occurred so far during one yum run, though the crash was repeated a few times, but I might get a better idea of how reproducible it is when the next batch of updates appears.
Confirmed; just saw this myself during a yum update of an x86_32 guest
Okay, I now have a reliable reproducer with a fully up to date guest. Just running: /usr/lib/nss/unsupported-tools/shlibsign -i /lib/libsoftokn3.so seems to trigger it for me Both armbru and I also saw crond trigger it when running cron.hourly
Further narrowed it down to any dlopen() on a library whose first segment is to be loaded at a reasonably high virtual address i.e. running this: #include <dlfcn.h> int main(int argc, char **argv) { return dlclose(dlopen(argv[1], RTLD_LAZY)); } against any of the libraries returned by: for iii in /lib/lib*.so.* /usr/lib/lib*.so.*; do [ -L $iii ] && continue; v=$(eu-readelf -l $iii | awk '/LOAD/ { if (match($3, "[^0x]")) {print $3} exit}') ; [ "$v" ] && echo $iii $v; done | sort -n
Further notes: - This is perfectly reproducible on stock 2.6.25-rc6 pv_ops xen; so we can rule out the x86_64 xen patches as the cause - prelink is what's causing these libs to have a non-zero base load address; if I "prelink -u" a lib first, then it can be dlopen()ed without a problem - the oops occurs during dlclose() when we try to munmap() the base load address - I've only reproduced this on x86_32 so far, but I can't seem to get prelink to relocate libs on x86_64, so perhaps it is really an issue there too
An even simpler test case: -- #include <sys/types.h> #include <sys/stat.h> #include <sys/mman.h> #include <fcntl.h> #include <unistd.h> #include <stdio.h> #define MMAP_ADDR_GOOD (void *)0x3ffff000 #define MMAP_ADDR_BAD (void *)0x40000000 #define MMAP_LEN 0x1000 int main(int argc, char **argv) { int fd = open("/dev/zero", O_RDONLY); munmap(mmap(MMAP_ADDR_GOOD, MMAP_LEN, PROT_READ, MAP_PRIVATE, fd, 0), MMAP_LEN); printf("Mapping to %p succeeded\n", MMAP_ADDR_GOOD); munmap(mmap(MMAP_ADDR_BAD, MMAP_LEN, PROT_READ, MAP_PRIVATE, fd, 0), MMAP_LEN); printf("Mapping to %p succeeded\n", MMAP_ADDR_BAD); close(fd); return 0; } -- The BUG occurs on the second munmap()
Posted a fix for this to lkml, see: http://lkml.org/lkml/2008/3/28/286 Awaiting some upstream feedback before building in rawhide.
Should be fixed now in kernel-xen-2.6-2.6.25-0.12.rc7.git6.fc9