Note: This bug is displayed in read-only format because the product is no longer active in Red Hat Bugzilla.
For bugs related to Red Hat Enterprise Linux 5 product line. The current stable release is 5.10. For Red Hat Enterprise Linux 6 and above, please visit Red Hat JIRA https://issues.redhat.com/secure/CreateIssue!default.jspa?pid=12332745 to report new issues.

Bug 464043

Summary: Huge page backed guest on exit via signal faults kvm.
Product: Red Hat Enterprise Linux 5 Reporter: john cooper <john.cooper>
Component: xenAssignee: john cooper <john.cooper>
Status: CLOSED UPSTREAM QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 5.4CC: chrisw, john.cooper, mtosatti, nobody, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-02-05 03:09:10 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
tarball of patches, logs of failure and correction. none

Description john cooper 2008-09-26 01:29:36 UTC
Created attachment 317751 [details]
tarball of patches, logs of failure and correction.

Description of problem:

Launching a huge page backed qemu guest and then terminating with
a signal causes access to invalid heap memory.  This was noticed
on a host kernel configured with DEBUG_SLAB which just happens to
jostle the heap such that freed heap memory is referenced as a
pointer. 

Version-Release number of selected component (if applicable):

Believed to be fairly kvm-xx and host kernel agnostic.  However
specifically I've reproduced it under 2.6.25.7 and kvm-73 which
happened to be handy. 

How reproducible/ Steps to Reproduce:

Launch qemu on a host kernel configured with DEBUG_SLAB, let
the guest start to boost, and nail it with a SIGINT.  100%
reproducible.
  
Actual results:

Guest exit processing causes a kernel protection fault via
dereference of a bogus pointer.  Guest was also found to be
stuck in exit processing but this may have been a side
effect of debugging.

Expected results:

A signal terminated guest should cause no data corruption
when freeing kvm resources, should cause no kernel diagnostics,
and should exit cleanly.


Additional info:

The crux of the problem is due to kvm maintaining struct
page references for SPTEs which it attempts to free in this
error scenario after the huge page backed file has already
been dismantled.

The in-memory inode for the huge page backed file contains
a struct address_space to which related page structures
point, including the copies maintained by kvm.  During
exit processing the huge page file is dismantled via
truncation of the file, deletion of mappings, and finally
free of the inode structure in __do_exit() -> close_files().

In the failure scenario the open file close of /dev/kvm
occurs after the huge page file causing free of the
page structures which reference the previously freed
inode structure via now stale pointers:

close of huge page file:

-----------------------------------------------------------------------
hugetlb_put_quota: as 7b919c68 nrp 0 host 7b919b58 delta 1 fblks ffffffff
hugetlb_put_quota: as 7b919c68 nrp 0 host 7b919b58 delta 0 fblks ffffffff
Pid: 5305, comm: qemu-system-x86 Not tainted 2.6.25.7 #14
Sep 17 09:36:05 crash kernel:  
Call Trace:
 [<ffffffff8030a94e>] hugetlb_put_quota+0x50/0x6e
 [<ffffffff80275bb1>] hugetlb_unreserve_pages+0xd6/0xef
 [<ffffffff8030ab2d>] truncate_hugepages+0x18a/0x19c
 [<ffffffff8030ab3f>] ? hugetlbfs_delete_inode+0x0/0x41
 [<ffffffff8030ab74>] hugetlbfs_delete_inode+0x35/0x41
 [<ffffffff80295b9e>] generic_delete_inode+0x73/0xe7
 [<ffffffff8030abb7>] hugetlbfs_drop_inode+0x37/0x17c
 [<ffffffff802952b2>] iput+0x7c/0x80
 [<ffffffff802932b0>] dentry_iput+0x8a/0x9a
 [<ffffffff8029335e>] d_kill+0x21/0x42
 [<ffffffff8029432d>] dput+0xd3/0xdf
 [<ffffffff80284c02>] __fput+0x151/0x175
 [<ffffffff80284dcb>] fput+0x14/0x16
 [<ffffffff802822b5>] filp_close+0x66/0x71
 [<ffffffff80231232>] put_files_struct+0x6d/0xc1
 [<ffffffff802312c1>] __exit_files+0x3b/0x40
 [<ffffffff80232478>] do_exit+0x246/0x659
 [<ffffffff80232906>] do_group_exit+0x7b/0x96
 [<ffffffff8023a5a1>] get_signal_to_deliver+0x2de/0x30b
 [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27
 [<ffffffff8020a443>] do_notify_resume+0xbd/0x878
 [<ffffffff8024985d>] ? do_futex+0xb5/0xa49
 [<ffffffff80382801>] ? rb_insert_color+0xb9/0xe3
 [<ffffffff80243227>] ? enqueue_hrtimer+0x64/0x6d
 [<ffffffff802437e1>] ? hrtimer_start+0x117/0x129
 [<ffffffff802908ac>] ? sys_select+0x11a/0x17b
 [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27
 [<ffffffff8020b3e7>] ptregscall_common+0x67/0xb0

hugetlbfs_destroy_inode.1: i 7b919b58 map 7b919c68
hugetlbfs_destroy_inode.2: i 7b919b58 map 6b6b6b6b  <-- inode overwritten on heap
-----------------------------------------------------------------------

Exception in kvm_vcpu_release():

-----------------------------------------------------------------------
general protection fault: 0000 [1] SMP
CPU 0
Modules linked in: kvm_intel kvm [last unloaded: kvm]
Pid: 5305, comm: qemu-system-x86 Not tainted 2.6.25.7 #14
RIP: 0010:[<ffffffff8030a920>]  [<ffffffff8030a920>] hugetlb_put_quota+0x22/0x6e
RSP: 0018:ffff8100721139d8  EFLAGS: 00010286
RAX: ffffffff80911860 RBX: 0000000000000000 RCX: 6b6b6b6b6b6b6b6b
RDX: ffff81007b919c68 RSI: ffffffff80603290 RDI: ffff81007b919c68
RBP: ffff8100721139f8 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000001
R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000001
R13: ffff81007b919c68 R14: ffff810071dfc000 R15: ffff810071d58000
FS:  0000000000000000(0000) GS:ffffffff807eb000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000026e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process qemu-system-x86 (pid: 5305, threadinfo ffff810072112000, task ffff81006b576040)
Stack:  ffff8100721139f8 ffffffff8027560c 0000000000000000 ffffe20000421000
 ffff810072113a28 ffffffff802756e0 ffff810072113a38 ffffffff8027561e
 ffffe20000421000 ffff81007a8bc6e0 ffff810072113a48 ffffffff8026208c
Call Trace:
 [<ffffffff8027560c>] ? enqueue_huge_page+0x2a/0x3c
 [<ffffffff802756e0>] free_huge_page+0xc2/0xca
 [<ffffffff8027561e>] ? free_huge_page+0x0/0xca
 [<ffffffff8026208c>] put_compound_page+0x61/0x80
 [<ffffffff802626ce>] put_page+0x4d/0xf5
 [<ffffffff88008e2e>] ? :kvm:kvm_mmu_zap_page+0x1f8/0x262
 [<ffffffff88000c81>] :kvm:kvm_release_pfn_clean+0x41/0x64
 [<ffffffff88000cbd>] :kvm:kvm_release_pfn_dirty+0x19/0x1d
 [<ffffffff88008aa0>] :kvm:rmap_remove+0x85/0x194
 [<ffffffff88008c88>] :kvm:kvm_mmu_zap_page+0x52/0x262
 [<ffffffff88009379>] :kvm:free_mmu_pages+0x1a/0x45
 [<ffffffff880093c3>] :kvm:kvm_mmu_destroy+0x1f/0x65
 [<ffffffff88002c09>] :kvm:kvm_arch_vcpu_uninit+0x25/0x44
 [<ffffffff880018dc>] :kvm:kvm_vcpu_uninit+0x11/0x21
 [<ffffffff88024b79>] :kvm_intel:vmx_free_vcpu+0x78/0x8b
 [<ffffffff88002852>] :kvm:kvm_arch_vcpu_free+0xe/0x10
 [<ffffffff88002a67>] :kvm:kvm_arch_destroy_vm+0x103/0x154
 [<ffffffff88001478>] :kvm:kvm_put_kvm+0x6d/0x87
 [<ffffffff880018b3>] :kvm:kvm_vcpu_release+0x13/0x17
 [<ffffffff80284b6a>] __fput+0xb9/0x175
 [<ffffffff80284dcb>] fput+0x14/0x16
 [<ffffffff802822b5>] filp_close+0x66/0x71
 [<ffffffff80231232>] put_files_struct+0x6d/0xc1
 [<ffffffff802312c1>] __exit_files+0x3b/0x40
 [<ffffffff80232478>] do_exit+0x246/0x659
 [<ffffffff80232906>] do_group_exit+0x7b/0x96
 [<ffffffff8023a5a1>] get_signal_to_deliver+0x2de/0x30b
 [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27
 [<ffffffff8020a443>] do_notify_resume+0xbd/0x878
 [<ffffffff8024985d>] ? do_futex+0xb5/0xa49
 [<ffffffff80382801>] ? rb_insert_color+0xb9/0xe3
 [<ffffffff80243227>] ? enqueue_hrtimer+0x64/0x6d
 [<ffffffff802437e1>] ? hrtimer_start+0x117/0x129
 [<ffffffff802908ac>] ? sys_select+0x11a/0x17b
 [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27
 [<ffffffff8020b3e7>] ptregscall_common+0x67/0xb0
-----------------------------------------------------------------------

In the simplest case we need only cause the close
on /dev/kvm to occur before the tear-down of the huge
page file.  I tried to force this via flagging the
struct file corresponding to /dev/kvm via poking O_SYNC
into f_flags at the time of open() from libkvm/libkvm.c:kvm_init(),
and forcing two passes over open files in kernel/exit.c:close_files(),
first pass to close all struct file(s) flagged via f_flags & O_SYNC
and a second pass to close the balance.  This didn't pan out as
for some reason I wasn't getting into kvm_vcpu_release() ahead
of huge file tear-down as expected.  So I punted and added a
shameless hook in __do_exit() to allow an even uglier hack
of allowing a wrapper to kvm_vcpu_release() before getting
into __exit_files().  This appears to remedy the sequencing
issue preventing the reference to the freed huge page file's
inode.

Patches are attached which are intended only as documentation
of the above, rough validation of the problem cause and
approach suggested for solution.

Comment 2 Marcelo Tosatti 2009-02-05 03:09:10 UTC
This is fixed in upstream: sptes pointing to compound pages are released 
before the inode is freed, and -mem-path is disabled without mmu notifiers.