Bug 523941 - kernel 2.6.31-1[24].fc12 doesn't boot in xen PV guest on 64b host
Summary: kernel 2.6.31-1[24].fc12 doesn't boot in xen PV guest on 64b host
Keywords:
Status: CLOSED CURRENTRELEASE
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 12
Hardware: All
OS: Linux
high
medium
Target Milestone: ---
Assignee: Andrew Jones
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
: 528053 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2009-09-17 10:02 UTC by Jiri Denemark
Modified: 2013-01-10 05:28 UTC (History)
12 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
: 523949 (view as bug list)
Environment:
Last Closed: 2010-02-23 18:19:29 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
Kernel output with earlyprintk=xen (14.07 KB, text/plain)
2009-10-02 08:37 UTC, Jiri Denemark
no flags Details
boot hang on bisected commit (2.54 KB, application/octet-stream)
2009-10-14 14:45 UTC, Andrew Jones
no flags Details
fedora config for last bootable kernel (95.55 KB, application/octet-stream)
2009-10-22 07:54 UTC, Andrew Jones
no flags Details
current rawhide kernel config (101.53 KB, application/octet-stream)
2009-10-22 07:56 UTC, Andrew Jones
no flags Details
latest available fedora kernel config (104.07 KB, application/octet-stream)
2009-10-22 07:57 UTC, Andrew Jones
no flags Details

Description Jiri Denemark 2009-09-17 10:02:46 UTC
Description of problem:

When trying to install 32b Fedora as a PV guest under RHEL-5.4 Xen, the installation kernel hangs at the very beginning and doesn't even print a single line. However, 64b kernel works fine.

Version-Release number of selected component (if applicable):

Reproduced with 2.6.31-12 and 2.6.31.14

How reproducible:

100%

Steps to Reproduce:
1. virt-install --nographics --paravirt --name=f12-32 --ram=1500 --file=/var/lib/xen/images/virval/f12-32.img --file-size=4 --location=/mnt/download/fedora/linux/development/rawhide-20090916/i386/os
  
Actual results:

Nothing. Not a single line of message from the kernel.

Expected results:

Happily booting kernel.

Additional info:

Comment 1 Jiri Denemark 2009-09-17 13:47:58 UTC
Note, that the kernel boots on 32b dom0...

Comment 2 Chuck Ebbert 2009-09-27 11:18:54 UTC
Possibly the same as bug 525290, which has been fixed very recently. kernel-2.6.31.1-48 has the fix for that.

Comment 3 Jeremy Fitzhardinge 2009-09-28 19:50:47 UTC
No, bug 525290 only affects 64-bit kernels.

Comment 4 Chuck Ebbert 2009-09-30 00:02:08 UTC
We ended up disabling the stack protector for 2.6.30 i386 kernels in Fedora 11. I wonder if this bug is caused by having it enabled. The fix is supposedly in linux-2.6-xen-stack-protector-fix.patch in F-12 but maybe that's not enough?

Comment 5 Jeremy Fitzhardinge 2009-09-30 02:54:49 UTC
AFAIK the current set of patches should be OK for Xen+stackprotector+32b, but I guess there could be some other combination which fails.  Are there any more details about how it fails (Xen "xm dmesg" output, boot kernel with "earlyprintk=xen", etc)?

Comment 6 Jiri Denemark 2009-10-02 08:31:19 UTC
Today, I tried installing rawhide-20091001. Nothing shows up in xm dmesg when I try to boot the 2.6.31.1-56.fc12.i686 kernel. But the result with earlyprintk=xen is much better (thanks for that suggestion). I'll attach the output I got with that...

Comment 7 Jiri Denemark 2009-10-02 08:37:08 UTC
Created attachment 363434 [details]
Kernel output with earlyprintk=xen

Comment 8 Andrew Jones 2009-10-02 12:57:18 UTC
I took a few steps down the stack starting from the NULL deference:

BUG: unable to handle kernel NULL pointer dereference at (null)
IP:(early)  [<c04f7cc3>] check_slab+0x2b/0xba

The null dereference comes from calling the macro PageSlab on the argument page at the beginning of check_slab

static int check_slab(struct kmem_cache *s, struct page *page)
{
        int maxobj;

        VM_BUG_ON(!irqs_disabled());

        if (!PageSlab(page)) {
                slab_err(s, page, "Not a valid slab page");
...

We see from the stack that we got here from __slab_alloc, but in this case it actually came through alloc_debug_processing


static void *__slab_alloc(struct kmem_cache *s, gfp_t gfpflags, int node,
                          unsigned long addr, struct kmem_cache_cpu *c)
{
...
        if (unlikely(SLABDEBUG && PageSlubDebug(c->page)))
                goto debug;
...
debug:
        if (!alloc_debug_processing(s, c->page, object, addr))
                goto another_slab;
...
}

static int alloc_debug_processing(struct kmem_cache *s, struct page *page,
                                        void *object, unsigned long addr)
{
        if (!check_slab(s, page))
                goto bad;
...

None of this stack walking really points to anything specific, and it's confusing that we can have a null page in check_slab because we got as far as the call to alloc_debug_processing in __slab_alloc, and c->page is checked in __slab_alloc.  However, it would be interesting to see what happens if we try to boot with the config option CONFIG_SLUB_DEBUG turned off. Doing that would avoid this null-dereferencing path.

Comment 9 Jiri Denemark 2009-10-02 13:05:17 UTC
And can't that be caused by: "Thread overran stack, or stack corrupted"?

Comment 10 Andrew Jones 2009-10-02 14:00:54 UTC
In this case I don't think so. Since the parameter that is strange here, page==NULL, is passed as a register, %edx, then I don't think the stack (overrun or not) would affect it. Besides, the stack and the registers look pretty sane.

Comment 11 Jeremy Fitzhardinge 2009-10-02 22:50:34 UTC
Which slab allocator is this? If its slub, then the kasprintf could be the very first allocation which happens after the slab_state is set up UP.

However, in __slab_alloc(), I don't see how it can get to the debug: label with c->page == NULL.  That should be dealt with at new_slab:, and if it fails to allocate a new page it will return NULL...

Comment 12 Andrew Jones 2009-10-05 12:12:11 UTC
yeah, it doesn't make much sense to have the stack we have. That backtrace disappeared on me though when I made my own build, although it still hangs early in the boot. early_printk gets varying amounts of output to the console before the hang, so it's hard to tell exactly where we stop processing.  All that said, since this is a reproducible regression from running an F11 guest, I think the problem should be bisectable.  I'm starting the bisecting exercise now.

Comment 13 Andrew Jones 2009-10-09 16:58:53 UTC
*** Bug 528053 has been marked as a duplicate of this bug. ***

Comment 14 Andrew Jones 2009-10-10 12:04:54 UTC
The status of the bisecting is that I can't seem to locally build a kernel that matches the koji builds at http://kojipkgs.fedoraproject.org/packages/kernel. There certainly aren't enough kernel builds available there to bisect with them alone. I need to build my own bzImages.

What I've been doing is this:

cvs co -r <some-kernel-rev> kernel
cd to appropriate dir kernel/F-11
make prep
cd to next appropriate dir kernel-.../linux-...
cp desired config file to .config
make nonint_oldconfig
make -j4 bzImage

I really don't know why this shouldn't work, but for example I know that the last bootable koji build is kernel-PAE-2.6.29.6-217.2.16.fc11.i686. So this is my base "good" rev.  Doing the procedure above using
kernel-2_6_29_6-217_2_16_fc11 for the rev and the config file from the working RPM I should have a base "good" bzImage.  However, that bzImage doesn't boot, and neither does any other image I've created. I haven't yet tried to test my build procedure by attempting to boot bare-metal or by building truly known-good kernels from F10, but I really don't know what could be wrong with the procedure anyway. I'm building on an i686 machine, and I've also tried building on the vm that I successfully booted with the koji build rev stated above.

Does anybody have the magic recipe to make a koji equivalent bzImage locally?

Comment 15 Andrew Jones 2009-10-10 14:01:49 UTC
I forgot to mention that in my grub.conf I stole the initrd from the working RPM, and I also created a symlink in /lib/modules to point to the working modules dir with the appropriate version for my bzImage, which in my case 2.6.29.6-atj. Although the boot doesn't get far enough to use either of those.

I also forgot to mention that I've tried builds from the upstream linux-2.6.29.y git tree as well. I got the same non-bootable results.

Doing my own 'make scratch-build-i686' using koji creates working rpms, but it's super slow. I guess good things come to those who wait...

Comment 16 Mike McGrath 2009-10-10 16:28:30 UTC
(In reply to comment #14)
> Does anybody have the magic recipe to make a koji equivalent bzImage locally?  

You're doing a mock build locally and it's still not working as koji works?  

Would a koji scratch build do what you're looking to do?  

  koji build --scratch dist-rawhide kernel-2.6.31-14.src.rpm

Comment 17 Andrew Jones 2009-10-12 12:37:43 UTC
koji building does work. I can create an rpm with a known-good kernel revision, install it, and then boot the system. Local mock building probably also works, but I didn't wait long enough to find out.

The problem is that local mock building and remote koji building are both too slow for the rapid build test cycle of bisecting, so I'm trying to short-cut the process and just create the bzImage. However, I still can't boot even when I create a bzImage that should be equivalent to the vmlinuz from an rpm created by a working koji build. This indicates my short-cut build procedure is flawed somehow, and therefore I'll never be able to bisect with it since I'll never see a "good" build.

Another thought is that my building might be fine, but the flaw is in the installation (since koji builds are installed with the rpm util).  I don't know how my installation would be flawed though, since as I outlined above, I have a symlink ready in /lib/modules pointing to known-good mods, I've updated grub.conf to boot the bzImage, which of course I copy into /boot for each trial, and I've stolen the initrd grub line that corresponds to the modules I link to.

Yet another hiccup in the bisecting road is that there really aren't that many tags for fedora revs in cvs. Therefore bisecting will really have to be done on the upstream git repo in order to get some granularity. Like I said in comment 15 though, building with this short-cut procedure on the upstream git repo also fails to generate bootable images for me. Furthermore the bzImage isn't as equivalent to the koji build as it was when building from cvs, since it's missing the fedora patches. We really need a way to apply those patches first, but I tried simplying applying them in the kernel.spec order, and it didn't work so slick, many failed to apply. Sigh...

Comment 18 Chuck Ebbert 2009-10-13 11:47:37 UTC
Mock build doesn't take all that long if you only build the essential packages:

mock -r my-rawhide-x86_64 --with=baseonly --with=firmware --without=debuginfo $1

Comment 19 Andrew Jones 2009-10-13 17:49:34 UTC
Ok, there's still something unexplainable about how koji builds differ from simply doing a make, but thanks to Chris L., I was finally able to boot (and see that I'm booting) my own known-good builds. It appears that if you build manually you need to make sure that you also manually add 'console=tty0 console=hvc0' (the order matters) to the kernel command line.

So being able to trust my own builds and having a starting known-good point, I was finally able to do the bisecting. I would have liked to stay more integrated with fedora, but since fedora doesn't have its own git tree (which makes bisecting a breeze), I really just focused on the stable upstream tree, git://git.kernel.org/pub/scm/linux/kernel/git/stable/linux-2.6-stable.git. I still used the latest rawhide config file.

The bisecting exercise brought me to

commit 93dbda7cbcd70a0bd1a99f39f44a9ccde8ab9040
Author: Jeremy Fitzhardinge <jeremy.fitzhardinge>
Date:   Thu Feb 26 17:35:44 2009 -0800

    x86: add brk allocation for very, very early allocations

so the attached backtrace looks supportive, since it was in __slab_alloc early in the boot.

Jeremy is already on the CC list for this bug.  I'll let him comment to this finding.

Comment 20 Jeremy Fitzhardinge 2009-10-13 19:19:59 UTC
(In reply to comment #19)
> The bisecting exercise brought me to
> 
> commit 93dbda7cbcd70a0bd1a99f39f44a9ccde8ab9040
> Author: Jeremy Fitzhardinge <jeremy.fitzhardinge>
> Date:   Thu Feb 26 17:35:44 2009 -0800
> 
>     x86: add brk allocation for very, very early allocations
> 
> so the attached backtrace looks supportive, since it was in __slab_alloc early
> in the boot.
> 
> Jeremy is already on the CC list for this bug.  I'll let him comment to this
> finding.  

That's very interesting.  But odd.  This change has been in mainline for just on 8 months now with no reported problems, so I'm assuming that there's some other interaction going on here that's affected by this patch.

Judging from the boot log, it doesn't look like anything even made any brk allocations (which isn't unexpected).

So my thought is that there's been a conflict with some other Fedora kernel patch; quite likely in the vmlinux_*.lds.S files.

What does "nm -n vmlinux" show?

Comment 21 Andrew Jones 2009-10-14 14:43:36 UTC
I'm not sure this code has been overly tested even though it's 8 months old. The problem is only seen with 32-on-64, and this patch wasn't merged in until 2.6.30. 2.6.30 didn't get distributed with f11 until later in its life.

I can consistently reproduce the boot hang when moving from the commit immediately before it, to this commit. I'm not on a Fedora patched kernel, as I'm using the stable upstream git repo. I did use a Fedora config file, but I don't change the config in anyway when moving between the booting commit and this commit.

The __slab_alloc stuff is probably not important for now, since when I reproduce the hang exactly on this commit, I don't see it, or any backtrace. I just hang, but consistently during the early printing of the early_node_map. I'll attach the output.

I'm not sure what exactly you'd like from nm, but guessing you'd like to see how the patched in region looks, it looks fine to me, exactly 1M brk region added.

bootable:
00000000c100a670 B __bss_stop
00000000c100a670 B _end
00000000c100b000 B pg0

not bootable:
00000000c100a670 B __bss_stop
00000000c100b000 B __brk_base
00000000c110b000 B __brk_limit
00000000c110b000 B _end
00000000c110b000 B pg0

Comment 22 Andrew Jones 2009-10-14 14:45:28 UTC
Created attachment 364758 [details]
boot hang on bisected commit

Comment 23 Andrew Jones 2009-10-14 15:08:16 UTC
I should also note that I've tried several attempts to revert parts of the patch, to see if I could get a bootable image without reverting all of the patch, but all my attempts failed.  I also saw the following comment in include/xen/interface/xen.h

 *  9. There is guaranteed to be at least 512kB padding after the final
 *     bootstrap element. If necessary, the bootstrap virtual region is
 *     extended by an extra 4MB to ensure this.

and if I'm not mistaken then this hunk of the patch

+       max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
+                                 xen_start_info->nr_pt_frames * PAGE_SIZE +
+                                 512*1024);

is doing the 512kB padding bump. So I was suspect that bumping the bss by a MB might cause us to need to do the 4MB extension part as well, but trying this small patch to the commit in question didn't help

-       max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
+       unsigned long max_mapped_size = __pa(xen_start_info->pt_base) +
                                  xen_start_info->nr_pt_frames * PAGE_SIZE +
-                                 512*1024);
+                                 512*1024;
+       if (max_mapped_size > 0x400000)
+               max_mapped_size=(max_mapped_size+0x3fffff)&~0x3fffff;
+       max_pfn_mapped = PFN_DOWN(max_mapped_size);

Comment 24 Andrew Jones 2009-10-14 15:44:58 UTC
Ah... I thought I did this trial, but must have fumbled it. Just reverting the most suspect parts of the patch allows booting.  The most suspect are the ones that only apply to 32b.

-#ifdef CONFIG_X86_32
-       init_mm.brk = init_pg_tables_end + PAGE_OFFSET;
-#else
-       init_mm.brk = (unsigned long) &_end;
-#endif
+       init_mm.brk = _brk_end;

and

-       init_pg_tables_start = __pa(pgd);
-       init_pg_tables_end = __pa(pgd) + xen_start_info->nr_pt_frames*PAGE_SIZE;
-       max_pfn_mapped = PFN_DOWN(init_pg_tables_end + 512*1024);
+       max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
+                                 xen_start_info->nr_pt_frames * PAGE_SIZE +
+                                 512*1024);

Reverting those hunks work for the commit. Reverting them for the latest kernel rev isn't possible since init_pg_tables_start is already gone.  It's probably not desired either.

Comment 25 Jeremy Fitzhardinge 2009-10-14 23:12:45 UTC
Interesting.  Those are the kinds of places I'd expect to be problematic...

Hm, but I'd really have expected to see some reports of problems before now.  Oh, well, I'll look into it.

Comment 26 Andrew Jones 2009-10-15 15:27:07 UTC
Chris L. tipped me off that a latest build of the f13 kernel (2.6.32-rc3) booted for him. I tried myself building from the f13 koji src rpm, but it didn't boot for me. However, then I built f12 from the latest cvs repo (2.6.31.4) and it did boot. I tried again with the latest of the upstream kernel, torvalds tree, which is currently somewhere past 2.6.32-rc4, but that didn't boot for me. I think there are probably several interactions, config files, fedora patches, bugs, etc. at play here. It's good to know some 2.6.3[12] kernels are booting though.

Comment 27 Jeremy Fitzhardinge 2009-10-15 18:03:27 UTC
Yes and no.  I'd prefer nice simple symptoms and repro case ;)

Comment 28 Justin M. Forbes 2009-10-21 19:28:14 UTC
I haven't seen a successful 2.6.31.x (f12) boot yet, including the build from yesterday. Since this is a boot option, we really need to get a fix in before F12 freeze so that people can install.  I can revert parts of the offending patch if necessary, but I would rather include what might be used for upstream.

Comment 29 Jeremy Fitzhardinge 2009-10-21 19:53:33 UTC
Can you reproduce this with a plain upstream kernel, or only on a RH kernel?

Comment 30 Justin M. Forbes 2009-10-21 20:26:40 UTC
Per comment 19, the bisect done was on the upstream linus tree, not the Fedora trees. The only thing rh specific used was the config file.

Comment 31 Jesse Keating 2009-10-21 22:25:42 UTC
Running out of time here folks.  Is this something we'd slip the release for?

Comment 32 Jeremy Fitzhardinge 2009-10-21 23:29:46 UTC
I'm looking at it at the moment.  I don't think the fix will be complex once we've identified the cause.

Justin, can you attach the .config?

Comment 33 Andrew Jones 2009-10-22 07:41:57 UTC
Just to restate the current status.  No kernels after 2.6.29.6 that I have tried will boot. I haven't been able to reproduce booting of 2.6.32 kernels like was stated in comment 26. The problem is on both fedora kernels and pure upstream, in fact all the bisecting I did was on pure upstream. When bisecting, I used the config from the latest rawhide at the time, and then just hit enter when make oldconfig asked me about different options. Although usually if I'm building a specific rev I will grab the config that most closely matches in revision number, which typically avoids many make oldconfig prompts.

I'll attach the latest rawhide config, which probably isn't exactly the same as what I used since rawhide changes quickly, but it will be close. I'll also attach the config from 2.6.29.6-217.2.16.fc11.i686.PAE, which is the last reliable fedora kernel rev.

I'm still looking into the weirdness with the 2.6.32 kernels possibly working once... But I think our best clue is really the result of the bisection.

Comment 34 Andrew Jones 2009-10-22 07:54:51 UTC
Created attachment 365664 [details]
fedora config for last bootable kernel

Comment 35 Andrew Jones 2009-10-22 07:56:01 UTC
Created attachment 365665 [details]
current rawhide kernel config

Comment 36 Andrew Jones 2009-10-22 07:57:20 UTC
Created attachment 365666 [details]
latest available fedora kernel config

Comment 37 Andrew Jones 2009-10-22 15:20:04 UTC
Some interesting progress today.  It turns out that the weirdness of the 2.6.32
kernel working once, and then not working later, is reproducible. Note, the
host is RHEL5.4.  Here's what you do.

Try booting a 2.6.32-rc5 kernel with 1024 MB memory allocated to it. That
should work (less than 1G mem, for example 512 MB, may or may not work,
usually not). With this 1G mem config you can boot it over and over. I've even
built the modules and created the initrd, and then booted and ran happily on
it. 

Now, bump up the memory allocation to something above 1024 MB, for example 1536.
That should also work. Then go back down to 1024 MB and you'll see it doesn't
boot. In fact, it won't boot again for any memory configuration less than 1536.

Two ways have been found that will allow you to boot again with less than the
last memory configuration. The first is to reboot the host. The second is to
run another VM, anything, at the same time you attempt to boot this one.

Other notes:
I still haven't been able to boot any 2.6.3x kernels less than 2.6.32-rc[45]
with any configuration of memory. Trying to boot 2.6.32-rc5 on pure upstream
Xen always fails, even with 1024 MB memory.

So it looks like there's a hypervisor bug that is exposed by guest 2.6.3x PAE kernels when running on 64-bit. I recall from Virt Test day that 32-on-32 worked.

Comment 38 Andrew Jones 2009-10-22 16:04:27 UTC
Hmm, actually the "running another VM at the same" trick doesn't seem to be so reliable. Only rebooting the host guarantees that you can step back down in memory allocation.

Comment 39 Jeremy Fitzhardinge 2009-10-22 17:20:25 UTC
OK, that's interesting.  It suggests that the problem depends on some state from Xen itself.  The normal way that manifests is because of some dependency on the exact MFNs Xen allocates for the domain, which makes sense given your observations about needing to reboot.

How much memory does the host have?  Does it make a difference to constrain the memory to less than 2 or 4G?

But I don't see anything in that patch which would depend on the values of particular MFNs (there's no manipulation of machine addresses or ptes at all in there).

If the change to max_pfn_mapped is relevent, then it may well depend on the exact size of the kernel, so any code or config change could affect the outcome without actually being specifically relevent.  That will make bisection unreliable and misleading.

However the fact that reverting specific parts of the patch is a strong indication of fault, of course.

I'll try building a kernel with your config and see if I can repro the bug.

Comment 40 Jesse Keating 2009-10-22 17:44:21 UTC
If it's a bug in the hypervisor, then it's not something we can "fix" for Fedora 12, nor should we delay Fedora 12 for it.  I'd be very interested in when you can confirm it is in fact a hypervisor bug as opposed to a guest kernel bug.

Comment 41 Jeremy Fitzhardinge 2009-10-22 18:07:32 UTC
I doubt its a hypervisor bug.

Comment 42 Mark McLoughlin 2009-10-23 13:13:08 UTC
Just to answer Jesse's question

(In reply to comment #31)
> Running out of time here folks.

Agreed. We have 12 days before final composing begins.

> Is this something we'd slip the release for?

IMHO, that's certainly a possibility. However, if it came down to it and we actually had to decide whether to slip or not, I could certainly understand if we changed our minds about whether being installable on Xen is a release blocker.

i.e. we should keep it on the blocker list for now, but I don't think it's a sure fire thing we would actually slip.

Comment 43 Andrew Jones 2009-10-23 16:20:02 UTC
Ok, time for another update to this saga.

2.6.31.5 kernels (both in the F12 tree and the upstream-stable) are booting
fine now. The memory issue (not being able to step back) has even disappeared. 
With respect to Fedora, the last available kernel package,
kernel-PAE-2.6.31.4-88.fc12.i686, didn't work, and the current one,
kernel-PAE-2.6.31.5-91.rc1.fc12.i686, does. So something recently has allowed
us to boot again.

I hope Jeremy doesn't run off just yet though, since moving that one commit
forward on the 2.6.29.6 tree kills the booting, and then reverting suspect
parts brings the booting back. We should understand that to make sure it's not
a bug that has been covered by recent patches rather than fixed.

Also, this working kernel rev (2.6.31.5) may not suffer from the memory issues
found yesterday, since those could have been introduced by patches it doesn't
contain that 2.6.32 kernels do. That's another thing to investigate further and
understand.

We've probably been banging against more than one bug, or other variables when
bouncing around kernel revs and configs, but I think we're finally getting some
progress at narrowing things down.

Comment 44 Justin M. Forbes 2009-10-23 17:26:54 UTC
Just an update, this is not related to the debug builds used for F12.  It appears the F11 2.6.30 kernels have the same problem.

Comment 45 Andrew Jones 2009-10-24 12:05:37 UTC
I drove git some more and here are the results.

1) The previously mentioned commit found by bisecting (we'll call it commit 1) kills the bootability.

2) This commit (commit 2) allows it to boot again

commit 33df4db04a79660150e1948e3296eeb451ac121b
Author: Jeremy Fitzhardinge <jeremy>
Date:   Thu May 7 11:56:44 2009 -0700

    x86: xen, i386: reserve Xen pagetables

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c

+       reserve_early(__pa(xen_start_info->pt_base),
+                     __pa(xen_start_info->pt_base +
+                          xen_start_info->nr_pt_frames * PAGE_SIZE),
+                     "XEN PAGETABLES");
+

---

3) Then with commit 3 the bootability goes away again, but with a new
   symptom. The symptom we see in early_printk is a NULL dereference BUG at
   xen_evtchn_do_upcall+0xc5/0x120, which is probably a propagation of
   something else. Indeed, we see that commit 3 involves the slab allocator,
   which is in the stack first attached to this BZ. That early_printk output
   also had a following NULL dereference BUG in xen_evtchn_do_upcall.

commit 83b519e8b9572c319c8e0c615ee5dd7272856090
Author: Pekka Enberg <penberg.fi>
Date:   Wed Jun 10 19:40:04 2009 +0300

    slab: setup allocators earlier in the boot sequence

    This patch makes kmalloc() available earlier in the boot sequence so we can
    rid of some bootmem allocations. The bulk of the changes are due to
    kmem_cache_init() being called with interrupts disabled which requires some
    changes to allocator boostrap code.

    Note: 32-bit x86 does WP protect test in mem_init() so we must setup traps
    before we call mem_init() during boot as reported by Ingo Molnar:

      We have a hard crash in the WP-protect code:

      [    0.000000] Checking if this processor honours the WP bit even in super
      [    0.000000]      EDI 00000188  ESI 00000ac7  EBP c17eaf9c  ESP c17eaf8c
      [    0.000000]      EBX 000014e0  EDX 0000000e  ECX 01856067  EAX 00000001
      [    0.000000]      err 00000003  EIP c10135b1   CS 00000060  flg 00010002
      [    0.000000] Stack: c17eafa8 c17fd410 c16747bc c17eafc4 c17fd7e5 000011f
      [    0.000000]        00099800 c17bb000 c17eafec c17f1668 000001c5 c17f132
      [    0.000000]        c166e033 c153a014 c18237cc 00020800 c17eaff8 c17f106
      [    0.000000] Pid: 0, comm: swapper Not tainted 2.6.30-tip-02161-g7a74539
      [    0.000000] Call Trace:
      [    0.000000]  [<c15357c2>] ? printk+0x14/0x16
      [    0.000000]  [<c10135b1>] ? do_test_wp_bit+0x19/0x23
      [    0.000000]  [<c17fd410>] ? test_wp_bit+0x26/0x64
      [    0.000000]  [<c17fd7e5>] ? mem_init+0x1ba/0x1d8
      [    0.000000]  [<c17f1668>] ? start_kernel+0x164/0x2f7
      [    0.000000]  [<c17f1322>] ? unknown_bootoption+0x0/0x19c
      [    0.000000]  [<c17f106a>] ? __init_begin+0x6a/0x6f

---

4) Commit 4 brings booting back again on upstream-stable.

commit d560bc61575efae43595cbcb56d0ba3b9450139c
Author: Jeremy Fitzhardinge <jeremy>
Date:   Tue Aug 25 12:53:02 2009 -0700

    x86, xen: Suppress WP test on Xen

    Xen always runs on CPUs which properly support WP enforcement in
    privileged mode, so there's no need to test for it.

    This also works around a crash reported by Arnd Hannemann, though I
    think its just a band-aid for that case.

---

Without digging too deeply it's clear that the commit pairs 1,2 and 3,4 are
related, so things make sense. We need to better understand the band-aid
aspect of 4 though. Especially since Fedora kernels > v2.6.31-rc9 were still
suffering from the check_slab+0x2b/0xba bug until 2.6.31.5.

Comment 46 Andrew Jones 2009-10-24 12:07:42 UTC
Yet another issue with Fedora kernels is that 2.6.31.1-56.fc12.i686.PAE is
the last kernel to show the check_slab+0x2b/0xba problem on boot. Koji pkgs
after that one, and before the first working one (2.6.31.5-91.rc1.fc12.i686.PAE)
have some bootloader problem.

xm create fails on the first following pkg (2.6.31.1-58.fc12.i686.PAE) with
an error message

Using config file "/etc/xen/rawhide-32pv-1".
Error: (1, 'Internal error', 'xc_dom_do_gunzip: inflate failed (rc=-5)\n')

The pkgs following that one show no output at all and xenctx shows we hung
immediately after jumping to the kernel

rip: 00010000
rsp: 00010000
rax: 00010000   rbx: e021c0a05fb0       rcx: 00010000   rdx: 00010000
rsi: 00010000   rdi: 00010000   rbp: 2460000e019
 r8: 00010000    r9: 00010000   r10: e0000000d8 r11: e0210000e021
r12: c0475ffe000e0000   r13: c0a05fc4   r14: c0a4f668c0add2ac   r15: c0aa0901
 cs: 00000000    ds: 00000000    fs: 00000000    gs: 00000000

This can maybe be ignored since it's fixed in the current release, otherwise
it should be addressed by a different bz.

Likewise the "memory step down problem" we see in 2.6.32 kernels will be
addressed another bug that I will file soon.

Comment 47 Chris Lalancette 2009-10-26 08:16:24 UTC
(In reply to comment #46)
> Yet another issue with Fedora kernels is that 2.6.31.1-56.fc12.i686.PAE is
> the last kernel to show the check_slab+0x2b/0xba problem on boot. Koji pkgs
> after that one, and before the first working one
> (2.6.31.5-91.rc1.fc12.i686.PAE)
> have some bootloader problem.
> 
> xm create fails on the first following pkg (2.6.31.1-58.fc12.i686.PAE) with
> an error message
> 
> Using config file "/etc/xen/rawhide-32pv-1".
> Error: (1, 'Internal error', 'xc_dom_do_gunzip: inflate failed (rc=-5)\n')

This is probably because this particular kernel is LZMA compressed, which the current RHEL-5 userspace can't decompress.  The patches to decompress are in upstream Xen, though, and should be in RHEL-5 shortly.  So this failure is probably not significant to this bug (except that it is preventing more testing!).

Chris Lalancette

Comment 48 Andrew Jones 2009-10-26 09:15:30 UTC
Oops, I thought I already had the lzma patch in my xen user-space. You're right, kernel 2.6.31.1-58.fc12.i686.PAE needed that patch (which I was missing) to boot far enough to show the same symptom as the other kernels before it, i.e. the check_slab+0x2b/0xba BUG.

Even with that patch the other kernels after 2.6.31.1-58.fc12.i686.PAE still have the same problem though, they hang immediately after jumping to 10000.

That said, I think it's a different issue that should be addressed with a different bug, if we choose to pursue it at all. It seems to be gone with 2.6.31.5 fedora kernels, and it doesn't even exist with upstream-stable kernels, so maybe we don't need to.

Comment 49 Andrew Jones 2009-10-26 09:16:14 UTC
I think the result from comment 45 finally focuses this bug on only one question. That question is, what's wrong with the commit 3 that the commit 4 "band-aids" in the case of upstream-stable, but doesn't band-aid in the case of the fedora kernels based off the same revisions?

Comment 50 Andrew Jones 2009-10-26 14:15:05 UTC
Ok, I didn't take my own advice of ignoring the "jump to 10000" hang. xenctx didn't show anything useful, but I took a look in xm dmesg and there was a stack there.

(XEN) Unhandled page fault in domain 1 on VCPU 0 (ec=0000)
(XEN) Pagetable walk from 00000000000004f8:
(XEN)  L4[0x000] = 0000000134eca027 0000000000002ef6
(XEN)  L3[0x000] = 0000000000000000 ffffffffffffffff
(XEN) domain_crash_sync called from entry.S
(XEN) Domain 1 (vcpu#0) crashed on cpu#3:
(XEN) ----[ Xen-3.1.2  x86_64  debug=n  Not tainted ]----
(XEN) CPU:    3
(XEN) RIP:    e019:[<00000000c0475f69>]
(XEN) RFLAGS: 0000000000000246   CONTEXT: guest
(XEN) rax: 0000000000000000   rbx: 00000000c2ef6000   rcx: 00000000c0a9e901
(XEN) rdx: 0000000000000000   rsi: 00000000c0adb254   rdi: 00000000c0a4d668
(XEN) rbp: 00000000c0a03fc4   rsp: 00000000c0a03fb0   r8:  0000000000000000
(XEN) r9:  0000000000000000   r10: 0000000000000000   r11: 0000000000000000
(XEN) r12: 0000000000000000   r13: 0000000000000000   r14: 0000000000000000
(XEN) r15: 0000000000000000   cr0: 000000008005003b   cr4: 00000000000026b0
(XEN) cr3: 00000001327c2000   cr2: 00000000000004f8
(XEN) ds: e021   es: e021   fs: 00d8   gs: 00e0   ss: e021   cs: e019
(XEN) Guest stack trace from esp=c0a03fb0:
(XEN)   00000000 c0475f69 0001e019 00010046 00000000 c0a03fd0 c04762a2 00000000
(XEN)   c0a03ffc c0a9e901 11400018 c04090b1 00000000 00000000 00000000 00000000
(XEN)   00000000 c2ef3000 00000000 00000000 00000000 00000000 00000000 00000000

The EIP is pointing at 

0xc0475f69 <trace_hardirqs_off_caller+99>:	cmpl   $0x0,0x4f8(%edx)

Since the hang goes away with 2.6.31.5 kernels, then without bisecting to be sure, I would guess it was fixed by the following commit, due to the irq and edx relations.

commit 33b6563da26335fbe6834b743ddd00fa8f7ab09a
Author: Jeremy Fitzhardinge <jeremy>
Date:   Mon Oct 12 16:32:43 2009 -0700

    x86/paravirt: Use normal calling sequences for irq enable/disable

---

So this makes a sum total of 4 bugs addressed by this one BZ to get f12 to boot... All are now understood and presumed fixed, except the one pointed to in comment 49 (which this bug should continue to address). I will also be opening the "can't step down memory" bug for 2.6.32 kernels today. And I think there might be yet another issue with booting 2.6.32 kernels, but I need to investigate more to know if it's different then the memory one.

Comment 51 Justin M. Forbes 2009-10-26 17:47:57 UTC
Removing this as a virt blocker.  The boot issues are fixed with F-12 using the latest 2.6.31.5 kernel currently available.  I am not closing the issue because there are other problems listed that still need to be investigated.

Comment 52 Jeremy Fitzhardinge 2009-10-26 18:08:40 UTC
(In reply to comment #45)

Andrew, thanks for putting so much effort into tracking this down.  Now that you remind me I remember all those fixes and I'd forgotten to consider the them with respect to your kernel.  Does this mean that there are patches in mainline which are missing from stable?

> 4) Commit 4 brings booting back again on upstream-stable.
> 
> commit d560bc61575efae43595cbcb56d0ba3b9450139c
> Author: Jeremy Fitzhardinge <jeremy>
> Date:   Tue Aug 25 12:53:02 2009 -0700
> 
>     x86, xen: Suppress WP test on Xen
> 
>     Xen always runs on CPUs which properly support WP enforcement in
>     privileged mode, so there's no need to test for it.
> 
>     This also works around a crash reported by Arnd Hannemann, though I
>     think its just a band-aid for that case.
> 
> ---
> 
> Without digging too deeply it's clear that the commit pairs 1,2 and 3,4 are
> related, so things make sense. We need to better understand the band-aid
> aspect of 4 though. Especially since Fedora kernels > v2.6.31-rc9 were still
> suffering from the check_slab+0x2b/0xba bug until 2.6.31.5.  

My working theory on this one is that for some reason the WP test is causing interrupts to get enabled prematurely which causes crashes in interrupt handling (which was the Arnd's failure-mode).  What are your symptoms between 3-4?

Comment 53 Andrew Jones 2009-10-27 14:07:48 UTC
> with respect to your kernel.  Does this mean that there are patches in mainline
> which are missing from stable?

No, I think Linus' tree and the master of stable are equivalent. This patch (commit 4) should also be in the fedora 2.6.31.1 based kernels since it's been upstream since 2.6.31-rc9.

> My working theory on this one is that for some reason the WP test is causing
> interrupts to get enabled prematurely which causes crashes in interrupt
> handling (which was the Arnd's failure-mode).  What are your symptoms between
> 3-4?  

The symptom is the check_slab+0x2b/0xba bug seen in the early_printk console output, which was the same symptom seen on upstream stable until commit 4. So what's interesting is that the fedora kernels still have the symptom even when using kernels based off 2.6.31.1. I can take a closer look at their differences. What I'm wondering though is if commit 4 is only hiding a bug with commit 3 in some cases, but not all, which is why we still see it with the fedora kernels.

Comment 54 Andrew Jones 2009-10-27 17:36:01 UTC
I've opened bug 531311 for the 2.6.32 booting issue.

Comment 55 Paolo Bonzini 2009-10-27 17:58:32 UTC
While doing an unrelated bisection, I spotted this:

Checking if this processor honours the WP bit... BUG: unable to handle kernel NULL poi

This leaves only a dozen commits or so:

good: v2.6.31.5 064a16dc 3530c188 5ce00289 e2984cbf e9d59922 fa877c71
bad: 162bc7ab01

Comment 56 Bug Zapper 2009-11-16 12:32:56 UTC
This bug appears to have been reported against 'rawhide' during the Fedora 12 development cycle.
Changing version to '12'.

More information and reason for this action is here:
http://fedoraproject.org/wiki/BugZappers/HouseKeeping

Comment 57 Andrew Jones 2010-02-23 18:19:29 UTC
We've gotten enough runtime on F12 now that the final question that this BZ was still open for is now somewhat irrelevant. The question (in comment 45) was does "commit 4" fix the problem introduced by "commit 3". Closing this bug as current release.


Note You need to log in before you can comment on or make changes to this bug.