Bug 607650

Summary: KVM uses wrong permissions for large guest pages
Product: Red Hat Enterprise Linux 6 Reporter: Martin Banas <mbanas>
Component: kernelAssignee: Karen Noel <knoel>
Status: CLOSED CURRENTRELEASE QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.0CC: aarcange, amit.shah, bmarson, ddumas, dmalcolm, ebenes, emcnabb, ghacker, hdegoede, jarod, jbrier, jclift, jcm, jokajak, jpirko, jstodola, justin, kchamart, knoel, lihuang, liko, llim, lwang, lwoodman, maier, mbanas, michen, mishu, msauton, mtosatti, mvadkert, pholica, qcai, qwan, riel, rjones, shalli.vcgfdt, shuang, sushil.singh, syeghiay, tao, tburke
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
: 615225 (view as bug list) Environment:
Last Closed: 2010-11-11 15:44:09 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 582286, 599016, 615225    
Attachments:
Description Flags
Screeenshot of the crashed out anaconda.
none
screenshot of rhel6 guest console showing python exception
none
/tmp from inside rhel6 rc4/rc2 guest after python exception none

Description Martin Banas 2010-06-24 14:13:36 UTC
Created attachment 426588 [details]
logs from /tmp.

Description of problem:
Installation is failing from time to time (we can't reproduce it) in KVM in time packages are being installed. Sometimes it crashes just after package group is selected, sometimes later.

I can't tell that it's happening just in KVM, it maybe the same on bare metal. 

Version-Release number of selected component (if applicable):
RHEL6.0-20100622.1, x86_64
anaconda-13.21.50-9.el6

How reproducible:
sometimes, without and good log files

Steps to Reproduce:
1. Start RHEL6 installation
2. Proceed to stage2 (select any installation source)
3. Leave all options default, proceed to package selection
4. Just click on next and wait if anaconda crashes.
  
Actual results:
Installer crasher

Expected results:
anaconda should be able to finish installation every time.

Additional info:

Comment 1 RHEL Program Management 2010-06-24 14:33:06 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 Ales Kozumplik 2010-06-24 16:20:41 UTC
I have no clue, let's debug it using all the steps below:

1) use updates=http://akozumpl.fedorapeople.org/bz607650.img. it should give us a tiny bit of more information.
2) use nokill option on the kernel command line.
3) do not reboot the machine.
4) call me over after a crash so we can take a look at the installation computer while its still running.
5) try to secure files in/mnt/sysimage/root. they should contain logs that will tell us what package is currently being installed---maybe it's one of them.
6) BTW is it possible that one of the QE tools running (or the QE engineer itself) sent anaconda the SIGUSR1 signal?

Thanks.
Ales

Comment 3 Martin Banas 2010-06-25 05:03:46 UTC
Hi Ales,

I'm going to use updates.img today, I'll tell you when I have something. Yes, yesterday before calling you we sent SIGUSR1 signal to anaconda to get tb_ logs :))

Comment 4 Martin Banas 2010-06-25 06:06:11 UTC
Created attachment 426773 [details]
traceback screenshot

I got traceback while using updates.img. Can you provide another updates.img?

Thanks

Comment 5 Ales Kozumplik 2010-06-25 09:45:54 UTC
New update posted at the same location.

Comment 6 Ales Kozumplik 2010-06-25 10:24:39 UTC
We've seen this kind of crash so far in:
* enablefilesystems
* installpackages
* right at the start of anaconda after udevadm is called

Comment 7 Ales Kozumplik 2010-06-28 11:01:40 UTC
If anaconda is always the same (that is top of the current beta branch), I can reproduced this when I use kernel+images from:
* releng 0622.1 (kernel 2.6.32-37)
* releng 0621.0 (kernel 2.6.32-37)
* releng 0617.0 (kernel 2.6.32-36)

I have not been able to reproduce this with:
* releng 0603.1 (kernel 2.6.32-33)
* nightly 0610.n (kernel 2.6.32-33)

Once again: the stage2 anaconda version is always the same and we haven't changed anything in stage1 since:

commit c58efd1e9d0971ad1ddd155be4fc930006af7a5c
Author: Chris Lumens <clumens>
Date:   Fri May 28 15:57:11 2010 -0400

Comment 8 Ales Kozumplik 2010-06-29 09:07:00 UTC
Also reproducible on:
* nightly 0628.n.3 (kernel 2.6.32-37)

Comment 9 Ales Kozumplik 2010-06-29 09:07:54 UTC
I discovered yesterday that the crashes happen because a routine deep down in Python C code (even glibc possibly) calls abort() in the anaconda process upon seeing a corrupted memory structure.

Comment 10 Ales Kozumplik 2010-06-29 11:28:45 UTC
Note: I don't know of anyone (the Brno RTT or myself) who has seen this on
anything else except x86_64 qemu-kvm virtual machine.

Also see bug 609071.

Comment 11 Ales Kozumplik 2010-06-29 13:07:21 UTC
All the kvm virtual machines passed 1-pass memtest without any errors.

So did the kvm host.

Comment 12 Steffen Maier 2010-06-29 14:13:06 UTC
(In reply to comment #6)
> We've seen this kind of crash so far in:
> * enablefilesystems
> * installpackages
> * right at the start of anaconda after udevadm is called    

It might not be the same kind of crash, however, Ales' patch to add a dump on anaconda crash might also help debugging bug 525804 and is thus a good thing.

Comment 13 Andrea Arcangeli 2010-06-29 14:44:16 UTC
Is there any swap in use, especially on the host? Do you run KSM?

Thanks!

Comment 14 Ales Kozumplik 2010-06-29 15:02:41 UTC
Hi Andrea,

The host:
[root@cobra03 ~]# cat /proc/swaps 
Filename				Type		Size	Used	Priority
/dev/dm-1                               partition	6160376	13100	-1

[root@cobra03 ~]# cat /sys/kernel/mm/ksm/pages_sharing 
246676

IOW: we didn't set anything special, its a rhel6 machine with qemu-kvm version 0.12.1.2.

The guest:
In anaconda, we don't use swap from the start but it is used once disks are mounted. At the moment most of the crashes appear, the disks are mounted. I have seen it crash before that though.

Comment 15 Ales Kozumplik 2010-06-29 15:25:00 UTC
*** Bug 609071 has been marked as a duplicate of this bug. ***

Comment 16 Andrea Arcangeli 2010-06-29 17:03:21 UTC
For now I'd like you to test this kernel (ideally both as host and guest, but you can start testing it on the host), to know if the problem goes away
with it (as long as it wasn't reproducible with
transparent hugepage disabled).

If swapping wasn't happening (at least on the host), I doubt it will help though.

http://brewweb.devel.redhat.com/brew/taskinfo?taskID=2560476

Comment 18 Ales Kozumplik 2010-06-29 17:15:05 UTC
(In reply to comment #16)
> For now I'd like you to test this kernel (ideally both as host and guest, but
> you can start testing it on the host), to know if the problem goes away
> with it (as long as it wasn't reproducible with
> transparent hugepage disabled).

Andrea,

I didn't try with hugepage disabled. How can it be set and where---host or guest?

Thanks.
Ales

Comment 19 Ales Kozumplik 2010-06-29 17:15:58 UTC
oh wait, it's been moved to kernel component. sorry for changing the status.

Comment 20 Ales Kozumplik 2010-06-29 17:19:49 UTC
also in that case, I guess the kernel QE can take care of any additional testing.

Comment 21 Andrea Arcangeli 2010-06-29 17:22:22 UTC
You tried with transparent hugepage disabled because you tested it on older kernels and it worked. Transparent hugepage has been enabled only on more recent kernels.

Can you just load the kernel at the above link in the host and see if it happens again? Hopefully it doesn't take too much time to be sure it's not reproducible anymore!

Comment 22 Ales Kozumplik 2010-06-30 07:56:21 UTC
Hi Andrea,

Our kvm server is a production machine used by my entire team and I can't adhoc install a new (experimental) kernel into it and reset it.

Ales

Comment 23 Hans de Goede 2010-06-30 08:42:23 UTC
Andrea,

To be clear the various kernel versions with which the tests were run mentioned in comment #7, were tested on the guest side. No changes were made on the host side during these tests.

So to be sure: Do you believe that enabling transparent huge page support inside the guest can cause problems on the host side? Or was there a misunderstanding that we were testing with different kernels on the host side?

Regards,

Hans

Comment 25 Ales Kozumplik 2010-06-30 12:10:36 UTC
I just tried and I am unable to reproduce this with kernel 2.6.32-35.el6.

Comment 26 Ales Kozumplik 2010-06-30 13:58:40 UTC
This could also be useful: while testing the 0630.n.0 compose I found that the best way to reproduce this is using the text installer.

Comment 27 Hans de Goede 2010-06-30 15:26:29 UTC
*** Bug 606700 has been marked as a duplicate of this bug. ***

Comment 30 Andrea Arcangeli 2010-06-30 23:14:07 UTC
to comment #22 and comment #23:

then we can test that kernel in guest. But if this was only the guest I've an hard time to see how this could be related to that patch. Still if that patch is hiding a race condition in the VM that doesn't only trigger with KSM copy, it's worth testing. THP enabled triggers stuff in the VM that wouldn't normally trigger without THP but those triggers are unrelated to the THP support. So it's worth testing the rpm in guest but I'm not optimistic.

If EPT is enabled on host and there are no transparent hugepages on host, it's unlikely to be a bug in the host.

Comment 31 Andrea Arcangeli 2010-06-30 23:17:22 UTC
And I assume you don't get any kernel error in guest (kernel logs) or you would have mentioned already, and you only get that python error in userland.

Comment 32 Andrea Arcangeli 2010-06-30 23:30:24 UTC
the tmp.tar.gz didn't include /tmp/updates/iutil.py

the error in this bug says TRANSLATION_UPDATE_DIR isn't defined, that is a much saner error than what is for example posted here:

https://bugzilla.redhat.com/attachment.cgi?id=427650

Looking the logs it seems there is no kernel error in guest, just these userland failures.

Comment 33 Ales Kozumplik 2010-07-01 06:48:38 UTC
Andrea,

the logs attached in the bug description are a bit misleading:
09:39:22,327 DEBUG   : X server has signalled a successful start.
09:39:22,329 ERROR   : Error running /usr/bin/metacity: Interrupted system call

Those two lines only appear because the QA engineer who first encountered the problem sent anaconda USR1 in an attempt to extract traceback (what he really meant was USR2, thus the error lines instead).

Normally, anaconda crashes without any error displayed at all, only the anaconda init process says that the termination was abnormal. I added code into more recent anaconda that also displays what signal killed it (if any) and, more importantly, coredumps. Upon inspecting the coredump I discovered that anaconda is abort()ed by a glibc() routine handling a malloc().

If you are interested in seeing the dump than get it touch with me (the core dump is 84 MB and you'll need the right debug symbols, I can help you with that or give you access to my testing machine).

Ales

Comment 36 Ales Kozumplik 2010-07-01 13:28:39 UTC
Thanks for sharing the link, does the new kernel need to be tested at both the host and the guest?

Comment 37 Andrea Arcangeli 2010-07-01 13:33:36 UTC
If you have cat /sys/module/kvm_intel/parameters/ept == Y on host, then it's hard to tell how hugepages on guest could trigger bugs in KVM.

If EPT is off, then yes you should try with latest rhel6 KVM on host too. If EPT is on, it should be enough to test the taskID=2569345 on guest, considering the bug doesn't trigger with transparent hugepage off.

Comment 38 Ales Kozumplik 2010-07-02 09:37:35 UTC
Good news,

I did several installs with kernel-2.6.32-41.el6transhuge in both the kvm host and the guest and I am fairly confident the problem is gone!

Comment 39 Martin Banas 2010-07-08 07:33:05 UTC
Hello, I hit the bug again: Anaconda died after receiving signal 6.

I was installing RHEL6.0-20100707.4, which has kernel-2.6.32-44. The host was RHEL5.5 with kernel 2.6.18-194.

Comment 40 Miroslav Vadkerti 2010-07-08 09:31:43 UTC
Same thing here, host fedora 12, tested compose the same as in comment #39

Comment 41 Andrea Arcangeli 2010-07-08 13:06:45 UTC
So I've been told on irc that host crashed too, and that ept is N.

So this sounds like a bug in kvm in RHEL 5.5 in dealing with shadow pagetables changing size.

Comment 42 Avi Kivity 2010-07-08 13:23:47 UTC
Likely a host kvm mmu bug.

Comment 43 Andrea Arcangeli 2010-07-08 16:06:28 UTC
can you verify that all systems where this happened (either corruption in guest or rhel5 host crash) where ept=0 systems?

Comment 44 Andrea Arcangeli 2010-07-08 16:06:51 UTC
s/where/were/

Comment 45 Ales Kozumplik 2010-07-08 16:31:29 UTC
(In reply to comment #43)
> can you verify that all systems where this happened (either corruption in guest
> or rhel5 host crash) where ept=0 systems?    

Martin,

you can check that by doing:

cat /sys/module/kvm_intel/parameters/ept

Thanks.
Ales

Comment 47 Avi Kivity 2010-07-10 06:12:27 UTC
Likely 372f84cecff2af0c5a14ebaef9563b1a2e2acfdb in kvm.git.

Comment 48 Avi Kivity 2010-07-10 06:24:17 UTC
Even more likely, 3be2264b.

Comment 49 Avi Kivity 2010-07-10 06:28:12 UTC
No, 3be2264b is only suspect if the host uses large pages as well.

Comment 50 Avi Kivity 2010-07-10 06:41:12 UTC
How much memory and cpus do you assign to a guest to reproduce this?

Comment 51 Miroslav Vadkerti 2010-07-10 18:21:10 UTC
I use always 1 CPU and 700-1000MB of RAM

Comment 52 Avi Kivity 2010-07-11 11:43:09 UTC
For those testing on RHEL 6 beta hosts, please try https://brewweb.devel.redhat.com/taskinfo?taskID=2587538.

Others, place your orders here.

Comment 53 Martin Banas 2010-07-12 10:07:28 UTC
Hi Ales,

There's no such file in RHEL5 host:
/sys/module/kvm_intel/parameters/ept

I also reproduced on my notebook with Fedora 12, where ept=n.

I always assign 1 CPU a 1GB of memory to the guest.

Comment 54 Chris Lumens 2010-07-12 13:32:48 UTC
*** Bug 613320 has been marked as a duplicate of this bug. ***

Comment 55 David Cantrell 2010-07-12 20:11:48 UTC
*** Bug 610261 has been marked as a duplicate of this bug. ***

Comment 56 Hans de Goede 2010-07-12 20:19:52 UTC
*** Bug 610255 has been marked as a duplicate of this bug. ***

Comment 57 Justin Clift 2010-07-14 13:49:28 UTC
Avi, do you want testing done on F13 too?

This problem occurs in RHEL 6 beta 2 in KVM on my F13 desktop, so easy to test.

Comment 58 Avi Kivity 2010-07-14 14:18:23 UTC
Yes.  Do you need me to build a test kernel (please say no)?

Comment 59 Justin Clift 2010-07-14 15:13:59 UTC
Thanks Avi (and Andrea).  Just tried your kernel from comment #52 on the F13 system here, using this ISO on KVM locally, and it still crashes out Anaconda:

  ftp://ftp.redhat.com/pub/redhat/rhel/beta/6Server-beta2/x86_64/iso/RHEL6.0-20100622.1-Server-x86_64-DVD1.iso

  $ cat /sys/module/kvm_intel/parameters/ept
  N
  $ rpm -qa | grep kvm
  qemu-kvm-0.12.3-8.fc13.x86_64
  $

The VM in question was allocated 1024MB ram, and 2 virtual cpu's.  Physically, it's running on a dual core box (E3300) with 4GB ram.

Comment 60 Justin Clift 2010-07-14 15:15:16 UTC
I can do a screencast of the whole process if you want, using something like RecordMyDesktop?  (it's pretty simple)

Comment 61 Justin Clift 2010-07-14 15:28:24 UTC
Created attachment 431813 [details]
Screeenshot of the crashed out anaconda.

Comment 62 Avi Kivity 2010-07-15 12:53:08 UTC
I now have a reliable reproducer (on F13 host).  Hacked khugepaged/scan_sleep_millisecs = 1 in guest initrd, crash is immediate.  Yay!

Comment 63 Avi Kivity 2010-07-15 12:59:29 UTC
bad: 2.6.33.6-147.fc13.x86_64

Comment 64 Andrea Arcangeli 2010-07-15 13:44:07 UTC
Awesome news from comment #62! And you can set it to 0 too so it won't even schedule outside of cond_resched.

In addition to be able to reproduce fast, this is great hint as it means we've only to focus on the change from 4k to 2M on the same guest virtual address.

NOTE: see mm/huge_memory.c:collapse_huge_page() and search for the pmdp_clear_flush_notify. I am only setting the pmd (regular pmd) to NULL and then flushing the tlb (the tlb flush for pmd isn't run with invlpg as there was some errata, I'm doing a safer cr3 overwrite inside pmdp_clear_flush_notify with ipis on all cpus with the active_mm) and I'm not touching the pte! Yet writing zero in the pmd must drop all ptes too. I guess that is what may be going wrong. Later (after writing zero in the regular pmd and writing self to cr3 with ipis on the relevant cpus, not necessarily current as it runs from kernel thread) I simply write again on the pmdp with the new hugepmd value with PSE set.

So the only thing that can be going wrong is that writing zero in the pmd and flushing the tlb, must get rid of all underlying 4k sptes too.

Comment 65 Avi Kivity 2010-07-15 14:09:46 UTC
good: kvm.git next (cb7eaecb3389c7fa2490ea1bee8f10cfa5df30d4)

Comment 66 Avi Kivity 2010-07-15 14:24:48 UTC
bad: 2.6.35-rc5+ (2f7989e)

Comment 67 Avi Kivity 2010-07-15 14:35:01 UTC
good: kvm.git 2b2e379

Comment 68 Avi Kivity 2010-07-15 15:02:24 UTC
indeterminate: kvm.git 83e2e42 (probable unrelated kernel issue)

Comment 69 Avi Kivity 2010-07-15 15:14:17 UTC
bad: kvm.git cda5dcb

Comment 70 Avi Kivity 2010-07-15 15:36:40 UTC
06f334e2b509b4c9f6c4cec7e0e56444a2730922 is the first good commit
commit 06f334e2b509b4c9f6c4cec7e0e56444a2730922
Author: Xiao Guangrong <xiaoguangrong.com>
Date:   Wed Jun 30 16:02:45 2010 +0800

    KVM: MMU: fix conflict access permissions in direct sp
    
    In no-direct mapping, we mark sp is 'direct' when we mapping the
    guest's larger page, but its access is encoded form upper page-struct
    entire not include the last mapping, it will cause access conflict.
    
    For example, have this mapping:
            [W]
          / PDE1 -> |---|
      P[W]          |   | LPA
          \ PDE2 -> |---|
            [R]
    
    P have two children, PDE1 and PDE2, both PDE1 and PDE2 mapping the
    same lage page(LPA). The P's access is WR, PDE1's access is WR,
    PDE2's access is RO(just consider read-write permissions here)
    
    When guest access PDE1, we will create a direct sp for LPA, the sp's
    access is from P, is W, then we will mark the ptes is W in this sp.
    
    Then, guest access PDE2, we will find LPA's shadow page, is the same as
    PDE's, and mark the ptes is RO.
    
    So, if guest access PDE1, the incorrect #PF is occured.
    
    Fixed by encode the last mapping access into direct shadow page
    
    Signed-off-by: Xiao Guangrong <xiaoguangrong.com>
    Signed-off-by: Marcelo Tosatti <mtosatti>

Bisect log: (inverted, good=bad and vice versa)

# bad: [8dea5648467102184c65d61cf2be6e0fbfa41060] KVM: VMX: fix tlb flush with invalid root
# good: [83e2e428db2c9f40c52f3f7764feec974e322183] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux-2.6
git bisect start 'HEAD' '83e2e42' 'arch/x86/kvm'
# good: [a63e16c655f9e68d49d6fae4275ffda16b1888b2] KVM: Prevent internal slots from being COWed
git bisect good a63e16c655f9e68d49d6fae4275ffda16b1888b2
# good: [a0a7ccde2fe285f4cbb71eeab3c9b7f7bb68231a] KVM: VMX: Execute WBINVD to keep data consistency with assigned devices
git bisect good a0a7ccde2fe285f4cbb71eeab3c9b7f7bb68231a
# bad: [372f84cecff2af0c5a14ebaef9563b1a2e2acfdb] KVM: MMU: fix forgot to flush all vcpu's tlb
git bisect bad 372f84cecff2af0c5a14ebaef9563b1a2e2acfdb
# bad: [06f334e2b509b4c9f6c4cec7e0e56444a2730922] KVM: MMU: fix conflict access permissions in direct sp
git bisect bad 06f334e2b509b4c9f6c4cec7e0e56444a2730922
# good: [52403eac7dadaef462954e0a680149d5d8536fac] KVM: MMU: fix writable sync sp mapping
git bisect good 52403eac7dadaef462954e0a680149d5d8536fac

Comment 71 Avi Kivity 2010-07-15 15:41:56 UTC
Andrea, does THP support PROT_READ pages?  That is, will it set a pmd with the writeable bit clear?

Comment 72 Avi Kivity 2010-07-15 15:46:20 UTC
Patch fixes upstream, so looking good.

Comment 73 Andrea Arcangeli 2010-07-15 15:54:11 UTC
Yes THP supports wrprotected shared anon pages with only PROT_READ set. It's identical to regular anon pages, but huge. But shared anon hugepages are only generated by fork, or if they're setup with mprotect/mmap without PROT_WRITE.

Comment 74 Avi Kivity 2010-07-15 16:37:35 UTC
Ok, that explains how the patch fixes the issue (could also be kvm picking up an existing kernel mapping for the new user mapping).

Comment 75 Avi Kivity 2010-07-15 17:03:04 UTC
https://brewweb.devel.redhat.com/taskinfo?taskID=2602160

Doesn't end for some reason.

Comment 76 Andrea Arcangeli 2010-07-15 17:09:20 UTC
to comment #74, I guess khugepaged was only needed to create more anon hugepages, so that more frequently fork would generate the readonly anon hugepages. Otherwise khugepaged would only work on writable ptes and create writable huge pmd.

Not sure I understand how kvm could pick an kernel mapping huge pmd for the user mapping, even if they both spte points to the same host physical address, they should always be at different guest virtual addresses. fork would instead generate the same guest virtual address. But they would be all readonly if they all point to the same host physical address.

Comment 77 Avi Kivity 2010-07-15 17:30:18 UTC
There are multiple failure modes for this:

1. guest kernel maps lowmem using huge pages
2. guest kernel touches page
3. kvm instantiates direct map for huge page with kernel mode access, since the guest huge page is mapped to host small pages
4. guest kernel maps same page to userspace using huge page
5. guest userspace touches page
6. kvm uses existing direct map from step 3 instead of generating a new one
7. guest userspace retries touching the page, #PF because it has kernel permissions

The alterative scenario is a read-only map at step 3 reused for a rw mapping in step 6 (leading to #PF) or a rw mapping at step 3 reused for a ro mapping in step 6 (corruption, what fun).

Comment 78 Avi Kivity 2010-07-15 17:33:47 UTC
oh, direct maps are not indexed by virtual address but by guest physical address.

(and indirect maps are not indexed by virtual address, instead they are indexed by the guest physical address of the page table used to map)

Comment 79 Avi Kivity 2010-07-15 18:14:58 UTC
I could have translated the source to optimized machine code by hand faster, but the build is complete:

  https://brewweb.devel.redhat.com/taskinfo?taskID=2602160

Please test (RHEL 6 host).

Comment 80 Justin Clift 2010-07-15 20:35:10 UTC
Good news.  That new kernel build in brew seems a lot better.

On my Fedora 13 workstation, the RHEL 6 beta 2 DVD now installs without issue.

Also installed the kernel on a RHEL 6 beta 2 host itself (with EPT=N), and then installed the RHEL 6 beta 2 DVD multiple times in that.  No problems at all.

On the Fedora side of things, any idea how long until the fix will make it's way to the public?

Wondering if we should put this kernel on a testing page (ie someone.fedorapeople.org/kernels/) for people to get early access to test it?

Specifically thinking of the guy from BZ #610911 here:

  https://bugzilla.redhat.com/show_bug.cgi?id=610911

Comment 81 Avi Kivity 2010-07-16 08:21:59 UTC
I put an F13 kernel on http://people.redhat.com/akivity/.  However I get a silly welcome page instead of a directory listing, perhaps it's a cache thing.

Comment 82 Avi Kivity 2010-07-16 08:29:51 UTC
Also: shell.devel.redhat.com:~akivity/kernel-2.6.33.6-147.avi.fc13.x86_64.rpm

Comment 83 Justin Clift 2010-07-16 10:42:13 UTC
Thanks Avi.

The people.redhat.com server gave me the welcome page greeting too, but shell.devel.redhat.com worked better.

Tried the package locally here, but couldn't get X running without matching -devel & -headers packages (to recompile nVidia drivers locally), so not able to test it.

Would you be able to put the matching -devel & -headers package on shell.devel.redhat.com?  I'll then test it here, and copy the packages externally for the guy to test with.

Comment 84 Avi Kivity 2010-07-16 10:57:09 UTC
-devel and -headers now on shell.devel.

Comment 85 Andrea Arcangeli 2010-07-16 11:41:48 UTC
Who's going to backport to rhel5, do we need a new bug for that?

Comment 86 Justin Clift 2010-07-16 11:46:50 UTC
BZ #615225 is a clone of this, for RHEL 5:

  https://bugzilla.redhat.com/show_bug.cgi?id=615225

Comment 87 Justin Clift 2010-07-16 12:30:29 UTC
Thanks Avi.  They're available publicly here for people now:

  http://justinclift.fedorapeople.org/bz610911/

And updated BZ #610911 to point Scott at them.

Comment 88 Hans de Goede 2010-07-20 14:28:00 UTC
*** Bug 616454 has been marked as a duplicate of this bug. ***

Comment 89 Miya Chen 2010-07-22 12:12:19 UTC
1. Tried to install rhel6 guest for 8 times with with host kernel as kernel-2.6.32-44.2.el6 and transparent hugepage is on, new crash was found during guest installation 

steps:
1) Install guest with tree RHEL6.0-20100622.1:
# /usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 4G -smp 4 -uuid `uuidgen` -monitor stdio -rtc base=localtime -usbdevice tablet -drive file=test.qcow2,if=none,format=qcow2,werror=stop,rerror=stop,id=drive-virtio0-0-0,boot=on,cache=none -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virtio0-0-0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=20:20:20:56:42:19 -cpu qemu64,+x2apic -vnc :10 -boot n
2) at the anaconda install wizard, select 'Basic-Server' and customize it by adding "Desktop"

2. load guest with transparent huge is on for 1h, no crash was found
for ((;;)) 
do 
dd if=/dev/uramdom of=/test bs=1M count=6000
rm -rf /test
done

num=$processor /proc/cpuinfo | tail -n1 | awk '{print $NF}')
for cpu in $(seq 0 $num)
do
taskset -c $cpu yes >/dev/null &
done

Comment 90 Dor Laor 2010-07-22 12:24:53 UTC
(In reply to comment #89)
> 1. Tried to install rhel6 guest for 8 times with with host kernel as
> kernel-2.6.32-44.2.el6 and transparent hugepage is on, new crash was found
> during guest installation 

Where is the crash info?

Comment 93 Dor Laor 2010-07-22 20:28:55 UTC
*** Bug 612525 has been marked as a duplicate of this bug. ***

Comment 94 Miya Chen 2010-07-23 02:05:49 UTC
(In reply to comment #89)
> 1. Tried to install rhel6 guest for 8 times with with host kernel as
> kernel-2.6.32-44.2.el6 and transparent hugepage is on, new crash was found
> during guest installation 
> 
> steps:
> 1) Install guest with tree RHEL6.0-20100622.1:
> # /usr/libexec/qemu-kvm -M rhel6.0.0 -enable-kvm -m 4G -smp 4 -uuid `uuidgen`
> -monitor stdio -rtc base=localtime -usbdevice tablet -drive
> file=test.qcow2,if=none,format=qcow2,werror=stop,rerror=stop,id=drive-virtio0-0-0,boot=on,cache=none
> -device virtio-blk-pci,drive=drive-virtio0-0-0,id=virtio0-0-0 -netdev
> tap,id=hostnet0,vhost=on -device
> virtio-net-pci,netdev=hostnet0,id=net0,mac=20:20:20:56:42:19 -cpu
> qemu64,+x2apic -vnc :10 -boot n
> 2) at the anaconda install wizard, select 'Basic-Server' and customize it by
> adding "Desktop"
> 
> 2. load guest with transparent huge is on for 1h, no crash was found
> for ((;;)) 
> do 
> dd if=/dev/uramdom of=/test bs=1M count=6000
> rm -rf /test
> done
> 
> num=$processor /proc/cpuinfo | tail -n1 | awk '{print $NF}')
> for cpu in $(seq 0 $num)
> do
> taskset -c $cpu yes >/dev/null &
> done    

sorry, in the first scenario, it should be "no crash was found"

Comment 95 Avi Kivity 2010-07-23 08:17:42 UTC
Whew.

Comment 96 Bill Burns 2010-07-23 11:27:26 UTC
+1

Comment 99 Aristeu Rozanski 2010-07-26 14:38:47 UTC
Patch(es) available on kernel-2.6.32-52.el6

Comment 101 Aristeu Rozanski 2010-07-26 15:16:29 UTC
Patch(es) available on kernel-2.6.32-52.el6

Comment 103 Dave Malcolm 2010-07-26 15:40:59 UTC
*** Bug 612853 has been marked as a duplicate of this bug. ***

Comment 104 Hans de Goede 2010-07-26 16:24:02 UTC
*** Bug 618227 has been marked as a duplicate of this bug. ***

Comment 108 Jarod Wilson 2010-07-27 05:45:18 UTC
I've got a rhel6 kvm guest that does a lot of mock builds. It hasn't managed to get through populating a mock chroot while running 2.6.32-52.el6 until just now, after I set transparent hugepages to 'never'. So methinks there's still a buglet somewhere with transparent hugepage support in guests.

Comment 110 Avi Kivity 2010-07-27 08:06:37 UTC
(In reply to comment #108)
> I've got a rhel6 kvm guest that does a lot of mock builds. It hasn't managed to
> get through populating a mock chroot while running 2.6.32-52.el6 until just
> now, after I set transparent hugepages to 'never'. So methinks there's still a
> buglet somewhere with transparent hugepage support in guests.    


What's on your host?  To fix the bug, 2.6.32-52.el6 needs to be on the host, not the guest.  (If the host is Fedora, try http://people.redhat.com/akivity/kernel-2.6.33.6-147.avi.fc13.x86_64.rpm as the host kernel).

Comment 112 Jarod Wilson 2010-07-27 15:06:39 UTC
(In reply to comment #110)
> (In reply to comment #108)
> > I've got a rhel6 kvm guest that does a lot of mock builds. It hasn't managed to
> > get through populating a mock chroot while running 2.6.32-52.el6 until just
> > now, after I set transparent hugepages to 'never'. So methinks there's still a
> > buglet somewhere with transparent hugepage support in guests.    
> 
> 
> What's on your host?  To fix the bug, 2.6.32-52.el6 needs to be on the host,
> not the guest.  (If the host is Fedora, try
> http://people.redhat.com/akivity/kernel-2.6.33.6-147.avi.fc13.x86_64.rpm as the
> host kernel).    

Ah, didn't realize that. The host is indeed Fedora, but Fedora 12, kernel 2.6.32.16-141.fc12.x86_64. Can you post the patch you've added to that F13 kernel somewhere? For this particular system, I'd prefer to just patch atop the latest F12 kernel for now.

Comment 113 Jarod Wilson 2010-07-27 19:49:11 UTC
Never mind, found it. Patch added to 2.6.32.16-153.fc12. Chuck is adding another few things to the f12 tree, will then tag and build for us laggards not on f13 (or 14) yet. ;)

Comment 114 Dor Laor 2010-07-28 13:01:20 UTC
*** Bug 596517 has been marked as a duplicate of this bug. ***

Comment 115 Jarod Wilson 2010-07-28 14:11:50 UTC
Local 2.6.32.16-153.fc12 build on my Fedora 12 host with rhel6 guest, transparent hugepages re-enabled in the guest, and things do indeed finally seem to be stable, made it through multiple mock builds last night without incident.

Comment 116 Amit Shah 2010-07-29 04:36:44 UTC
*** Bug 617204 has been marked as a duplicate of this bug. ***

Comment 117 Justin Clift 2010-08-02 07:17:50 UTC
*** Bug 610227 has been marked as a duplicate of this bug. ***

Comment 118 Panu Matilainen 2010-08-09 06:37:28 UTC
*** Bug 615102 has been marked as a duplicate of this bug. ***

Comment 119 Panu Matilainen 2010-08-09 08:36:57 UTC
*** Bug 619017 has been marked as a duplicate of this bug. ***

Comment 122 Dave Malcolm 2010-09-17 20:15:16 UTC
*** Bug 613917 has been marked as a duplicate of this bug. ***

Comment 123 Dave Malcolm 2010-09-17 20:44:26 UTC
*** Bug 612627 has been marked as a duplicate of this bug. ***

Comment 125 Dave Malcolm 2010-10-20 15:36:19 UTC
*** Bug 626279 has been marked as a duplicate of this bug. ***

Comment 126 John Brier 2010-10-23 16:03:34 UTC
I think I'm hitting this on RHEV, can someone help me confirm? Should we be expecting this on this version of RHEL 6:

http://download.devel.redhat.com/rel-eng/RHEL6.0-RC-4/6.0/Server/x86_64/os/

Hypervisor is 

[root@rhevh-4 ~]# uname -a
Linux rhevh-4.gsslab.rdu.redhat.com 2.6.18-194.3.1.el5 #1 SMP Sun May 2 04:17:42 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux

[root@rhevh-4 ~]# cat /etc/redhat-release 
Red Hat Enterprise Virtualization Hypervisor release 5.5-2.2 (4.2)

qemu-kvm process

 9419 ?        Sl    43:17 /usr/libexec/qemu-kvm -no-hpet -no-kvm-pit-reinjection -usbdevice tablet -rtc-td-hack -startdate 2010-10-23T13:15:46 -name rhel6 -smp 1,cores=1 -k en-us -m 1024 -boot nc -net nic,vlan=1,macaddr=00:1a:4a:0a:39:0d,model=virtio -net tap,vlan=1,ifname=virtio_12_1,script=no -drive file=/rhev/data-center/b2252e5b-70b9-428c-bd5e-474008b44982/7f888454-f103-4af3-b3ea-29e027c9d638/images/619fecbc-0b63-4fa8-834c-a741953f1865/ce4d01a4-04cc-498c-970b-41d200451226,media=disk,if=virtio,cache=off,serial=a8-834c-a741953f1865,boot=on,format=raw,werror=stop -pidfile /var/vdsm/50b70f81-35b9-4df6-a23d-5628d983ee83.pid -soundhw ac97 -spice sslpassword=,sslciphersuite=DEFAULT,sslcert=/var/vdsm/ts/certs/vdsmcert.pem,sslkey=/var/vdsm/ts/keys/vdsmkey.pem,ssldhfile=/var/vdsm/ts/keys/dh.pem,sslcafile=/var/vdsm/ts/certs/cacert.pem,host=0,secure-channels=main+inputs,ic=on,sport=5888,port=5912 -qxl 1 -cpu qemu64,+sse2,+cx16,+ssse3,+sse4.1 -M rhel5.5.0 -notify all -balloon none -smbios type=1,manufacturer=Red Hat,product=RHEV Hypervisor,version=5.5-2.2-4.2,serial=FF282989-953E-36C5-80A6-7CB9E0653068_00:1a:64:21:74:6a,uuid=50b70f81-35b9-4df6-a23d-5628d983ee83 -vmchannel di:0200,unix:/var/vdsm/50b70f81-35b9-4df6-a23d-5628d983ee83.guest.socket,server -monitor unix:/var/vdsm/50b70f81-35b9-4df6-a23d-5628d983ee83.monitor.socket,server

I'm attaching a screenshot of the error from the guest console and a tarball of /tmp from inside the guest

Comment 127 John Brier 2010-10-23 16:06:12 UTC
Created attachment 455260 [details]
screenshot of rhel6 guest console showing python exception

Comment 128 John Brier 2010-10-23 16:07:51 UTC
Created attachment 455261 [details]
/tmp from inside rhel6 rc4/rc2 guest after python exception

Comment 129 Justin Clift 2010-10-23 16:22:34 UTC
That does look similar to the bug as it first cropped up on RHEL 6 beta 2 hosts with RHEL 6 beta 2 guests.

Comment 130 Justin Clift 2010-10-23 16:31:09 UTC
As a thought, that server is running an old kernel from the RHEL 5.5 series:

  Linux rhevh-4.gsslab.rdu.redhat.com 2.6.18-194.3.1.el5 #1 SMP ...

The latest is 2.6.18-194.17.1:

  Linux localhost.localdomain 2.6.18-194.17.1.el5 #1 SMP ...

Looking at the release date of the older kernel, it was from before this bug
was known about and fixed.

Are you able to update the host server's packages?

Comment 132 James Antill 2010-11-01 13:17:59 UTC
*** Bug 629671 has been marked as a duplicate of this bug. ***

Comment 133 releng-rhel@redhat.com 2010-11-11 15:44:09 UTC
Red Hat Enterprise Linux 6.0 is now available and should resolve
the problem described in this bug report. This report is therefore being closed
with a resolution of CURRENTRELEASE. You may reopen this bug report if the
solution does not work for you.