Bug 436453

Summary:	F-9 pv_ops xen: dlclose() oops with prelinked libraries on x86_32
Product:	[Fedora] Fedora	Reporter:	Michael Young <m.a.young>
Component:	kernel-xen-2.6	Assignee:	Mark McLoughlin <markmc>
Status:	CLOSED RAWHIDE	QA Contact:	Fedora Extras Quality Assurance <extras-qa>
Severity:	low	Docs Contact:
Priority:	low
Version:	rawhide	CC:	berrange, ehabkost, xen-maint
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-xen-2.6-2.6.25-0.12.rc7.git6.fc9	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2008-04-01 14:01:50 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	434756

Description Michael Young 2008-03-07 10:51:18 UTC

I have been testing the kernel-xen-2.6-2.6.25-0.0.rc4.fc9 from koji, and aside
from it taking several attempts to get it to boot (it doesn't always see the
disks, which might or might not be related to bug 434760 ), I get the following
crash while running yum update during the install packages stage

kernel BUG at arch/x86/xen/enlighten.c:708!
invalid opcode: 0000 [#1] SMP
Modules linked in: nfs lockd nfs_acl rfcomm l2cap bluetooth autofs4 sunrpc ipv6
dm_mirror dm_multipath dm_mod xen_netfront pcspkr xen_blkfront ext3 jbd mbcache
uhci_hcd ohci_hcd ehci_hcd

Pid: 26913, comm: crond Not tainted (2.6.25-0.0.rc4.fc9xen #1)
EIP: 0061:[<c040342f>] EFLAGS: 00010282 CPU: 0
EIP is at xen_release_pt+0x79/0xa9
EAX: ffffffea EBX: c9c34ea0 ECX: 00000001 EDX: 00000000
ESI: 00007ff0 EDI: 0000883e EBP: c9c34eb8 ESP: c9c34ea0
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0069
Process crond (pid: 26913, ti=c9c34000 task=cf586000 task.ti=c9c34000)
Stack: 00000004 0000a411 00000000 c1458250 0000883e 0883e000 c9c34ecc c041ca01
       00000000 00000000 0480f000 c9c34f1c c0479146 089cb067 c10d1000 80000000
       4a999fff 4a99a000 c9c34f5c c480f008 4a99a000 cf970d04 c8847500 cf970d04
Call Trace:
 [<c041ca01>] ? __pmd_free_tlb+0x1a/0x75
 [<c0479146>] ? free_pgd_range+0x1d2/0x2b5
 [<c04792a7>] ? free_pgtables+0x7e/0x93
 [<c047a209>] ? unmap_region+0xb9/0xf5
 [<c047af89>] ? do_munmap+0x193/0x1f5
 [<c047b01b>] ? sys_munmap+0x30/0x3f
 [<c0408bda>] ? syscall_call+0x7/0xb
 =======================
Code: 3e eb 50 a1 f0 fd 84 c0 8b 04 b8 25 ff ff ff 7f 89 45 ec 8d 5d e8 b9 01 00
00 00 31 d2 be f0 7f 00 00 e8 15 9f 40 00 85 c0 74 04 <0f> 0b eb fe c1 e7 0c 8d
87 00 00 00 c0 e8 0a 1b 00 00 eb 14 80
EIP: [<c040342f>] xen_release_pt+0x79/0xa9 SS:ESP 0069:c9c34ea0
---[ end trace e148c210d8c026ec ]---

The error has repeated itself a few time, seemingly freezing the yum process but
commands can still be run from a different terminal session.

Comment 1 Mark McLoughlin 2008-03-07 14:59:17 UTC

Michael: thanks

I haven't seen your "doesn't always see the disks" issue before, so please do
log a bug on that

Adding to PvOpsTracker ..

Comment 2 Michael Young 2008-03-07 16:07:52 UTC

I have logged the disk issue as bug 436493 . I am guessing that for the current
bug that yum isn't directly involved, it just produces sort of conditions for
the problem to show itself. It has also only occurred so far during one yum run,
though the crash was repeated a few times, but I might get a better idea of how
reproducible it is when the next batch of updates appears.

Comment 3 Mark McLoughlin 2008-03-25 16:08:47 UTC

Confirmed; just saw this myself during a yum update of an x86_32 guest

Comment 4 Mark McLoughlin 2008-03-26 17:09:46 UTC

Okay, I now have a reliable reproducer with a fully up to date guest.

Just running:

  /usr/lib/nss/unsupported-tools/shlibsign -i /lib/libsoftokn3.so 

seems to trigger it for me

Both armbru and I also saw crond trigger it when running cron.hourly

Comment 5 Mark McLoughlin 2008-03-26 19:41:24 UTC

Further narrowed it down to any dlopen() on a library whose first segment is to
be loaded at a reasonably high virtual address

i.e. running this:

#include <dlfcn.h>

int main(int argc, char **argv)
{
  return dlclose(dlopen(argv[1], RTLD_LAZY));
}

against any of the libraries returned by:

for iii in /lib/lib*.so.* /usr/lib/lib*.so.*; do [ -L $iii ] && continue;
v=$(eu-readelf -l $iii | awk '/LOAD/ { if (match($3, "[^0x]")) {print $3}
exit}') ; [ "$v" ] && echo $iii $v; done | sort -n

Comment 6 Mark McLoughlin 2008-03-26 23:54:24 UTC

Further notes:

  - This is perfectly reproducible on stock 2.6.25-rc6 pv_ops xen; so
    we can rule out the x86_64 xen patches as the cause

  - prelink is what's causing these libs to have a non-zero base load
    address; if I "prelink -u" a lib first, then it can be dlopen()ed
    without a problem

  - the oops occurs during dlclose() when we try to munmap() the base
    load address

  - I've only reproduced this on x86_32 so far, but I can't seem to get
    prelink to relocate libs on x86_64, so perhaps it is really an issue
    there too

Comment 7 Mark McLoughlin 2008-03-28 16:32:23 UTC

An even simpler test case:

--
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>

#define MMAP_ADDR_GOOD (void *)0x3ffff000
#define MMAP_ADDR_BAD  (void *)0x40000000
#define MMAP_LEN 0x1000

int
main(int argc, char **argv)
{
  int fd = open("/dev/zero", O_RDONLY);
  munmap(mmap(MMAP_ADDR_GOOD, MMAP_LEN, PROT_READ, MAP_PRIVATE, fd, 0), MMAP_LEN);
  printf("Mapping to %p succeeded\n", MMAP_ADDR_GOOD);
  munmap(mmap(MMAP_ADDR_BAD, MMAP_LEN, PROT_READ, MAP_PRIVATE, fd, 0), MMAP_LEN);
  printf("Mapping to %p succeeded\n", MMAP_ADDR_BAD);
  close(fd);
  return 0;
}
--

The BUG occurs on the second munmap()

Comment 8 Mark McLoughlin 2008-03-28 16:48:41 UTC

Posted a fix for this to lkml, see:

  http://lkml.org/lkml/2008/3/28/286

Awaiting some upstream feedback before building in rawhide.

Comment 9 Mark McLoughlin 2008-04-01 14:01:50 UTC

Should be fixed now in kernel-xen-2.6-2.6.25-0.12.rc7.git6.fc9