Bug 464043
| Summary: | Huge page backed guest on exit via signal faults kvm. | ||||||
|---|---|---|---|---|---|---|---|
| Product: | Red Hat Enterprise Linux 5 | Reporter: | john cooper <john.cooper> | ||||
| Component: | xen | Assignee: | john cooper <john.cooper> | ||||
| Status: | CLOSED UPSTREAM | QA Contact: | Virtualization Bugs <virt-bugs> | ||||
| Severity: | high | Docs Contact: | |||||
| Priority: | high | ||||||
| Version: | 5.4 | CC: | chrisw, john.cooper, mtosatti, nobody, xen-maint | ||||
| Target Milestone: | rc | ||||||
| Target Release: | --- | ||||||
| Hardware: | All | ||||||
| OS: | Linux | ||||||
| Whiteboard: | |||||||
| Fixed In Version: | Doc Type: | Bug Fix | |||||
| Doc Text: | Story Points: | --- | |||||
| Clone Of: | Environment: | ||||||
| Last Closed: | 2009-02-05 03:09:10 UTC | Type: | --- | ||||
| Regression: | --- | Mount Type: | --- | ||||
| Documentation: | --- | CRM: | |||||
| Verified Versions: | Category: | --- | |||||
| oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||
| Cloudforms Team: | --- | Target Upstream Version: | |||||
| Embargoed: | |||||||
| Attachments: |
|
||||||
This is fixed in upstream: sptes pointing to compound pages are released before the inode is freed, and -mem-path is disabled without mmu notifiers. |
Created attachment 317751 [details] tarball of patches, logs of failure and correction. Description of problem: Launching a huge page backed qemu guest and then terminating with a signal causes access to invalid heap memory. This was noticed on a host kernel configured with DEBUG_SLAB which just happens to jostle the heap such that freed heap memory is referenced as a pointer. Version-Release number of selected component (if applicable): Believed to be fairly kvm-xx and host kernel agnostic. However specifically I've reproduced it under 2.6.25.7 and kvm-73 which happened to be handy. How reproducible/ Steps to Reproduce: Launch qemu on a host kernel configured with DEBUG_SLAB, let the guest start to boost, and nail it with a SIGINT. 100% reproducible. Actual results: Guest exit processing causes a kernel protection fault via dereference of a bogus pointer. Guest was also found to be stuck in exit processing but this may have been a side effect of debugging. Expected results: A signal terminated guest should cause no data corruption when freeing kvm resources, should cause no kernel diagnostics, and should exit cleanly. Additional info: The crux of the problem is due to kvm maintaining struct page references for SPTEs which it attempts to free in this error scenario after the huge page backed file has already been dismantled. The in-memory inode for the huge page backed file contains a struct address_space to which related page structures point, including the copies maintained by kvm. During exit processing the huge page file is dismantled via truncation of the file, deletion of mappings, and finally free of the inode structure in __do_exit() -> close_files(). In the failure scenario the open file close of /dev/kvm occurs after the huge page file causing free of the page structures which reference the previously freed inode structure via now stale pointers: close of huge page file: ----------------------------------------------------------------------- hugetlb_put_quota: as 7b919c68 nrp 0 host 7b919b58 delta 1 fblks ffffffff hugetlb_put_quota: as 7b919c68 nrp 0 host 7b919b58 delta 0 fblks ffffffff Pid: 5305, comm: qemu-system-x86 Not tainted 2.6.25.7 #14 Sep 17 09:36:05 crash kernel: Call Trace: [<ffffffff8030a94e>] hugetlb_put_quota+0x50/0x6e [<ffffffff80275bb1>] hugetlb_unreserve_pages+0xd6/0xef [<ffffffff8030ab2d>] truncate_hugepages+0x18a/0x19c [<ffffffff8030ab3f>] ? hugetlbfs_delete_inode+0x0/0x41 [<ffffffff8030ab74>] hugetlbfs_delete_inode+0x35/0x41 [<ffffffff80295b9e>] generic_delete_inode+0x73/0xe7 [<ffffffff8030abb7>] hugetlbfs_drop_inode+0x37/0x17c [<ffffffff802952b2>] iput+0x7c/0x80 [<ffffffff802932b0>] dentry_iput+0x8a/0x9a [<ffffffff8029335e>] d_kill+0x21/0x42 [<ffffffff8029432d>] dput+0xd3/0xdf [<ffffffff80284c02>] __fput+0x151/0x175 [<ffffffff80284dcb>] fput+0x14/0x16 [<ffffffff802822b5>] filp_close+0x66/0x71 [<ffffffff80231232>] put_files_struct+0x6d/0xc1 [<ffffffff802312c1>] __exit_files+0x3b/0x40 [<ffffffff80232478>] do_exit+0x246/0x659 [<ffffffff80232906>] do_group_exit+0x7b/0x96 [<ffffffff8023a5a1>] get_signal_to_deliver+0x2de/0x30b [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27 [<ffffffff8020a443>] do_notify_resume+0xbd/0x878 [<ffffffff8024985d>] ? do_futex+0xb5/0xa49 [<ffffffff80382801>] ? rb_insert_color+0xb9/0xe3 [<ffffffff80243227>] ? enqueue_hrtimer+0x64/0x6d [<ffffffff802437e1>] ? hrtimer_start+0x117/0x129 [<ffffffff802908ac>] ? sys_select+0x11a/0x17b [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27 [<ffffffff8020b3e7>] ptregscall_common+0x67/0xb0 hugetlbfs_destroy_inode.1: i 7b919b58 map 7b919c68 hugetlbfs_destroy_inode.2: i 7b919b58 map 6b6b6b6b <-- inode overwritten on heap ----------------------------------------------------------------------- Exception in kvm_vcpu_release(): ----------------------------------------------------------------------- general protection fault: 0000 [1] SMP CPU 0 Modules linked in: kvm_intel kvm [last unloaded: kvm] Pid: 5305, comm: qemu-system-x86 Not tainted 2.6.25.7 #14 RIP: 0010:[<ffffffff8030a920>] [<ffffffff8030a920>] hugetlb_put_quota+0x22/0x6e RSP: 0018:ffff8100721139d8 EFLAGS: 00010286 RAX: ffffffff80911860 RBX: 0000000000000000 RCX: 6b6b6b6b6b6b6b6b RDX: ffff81007b919c68 RSI: ffffffff80603290 RDI: ffff81007b919c68 RBP: ffff8100721139f8 R08: 6b6b6b6b6b6b6b6b R09: 0000000000000001 R10: 0000000000000010 R11: 0000000000000000 R12: 0000000000000001 R13: ffff81007b919c68 R14: ffff810071dfc000 R15: ffff810071d58000 FS: 0000000000000000(0000) GS:ffffffff807eb000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 0000000000201000 CR4: 00000000000026e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-system-x86 (pid: 5305, threadinfo ffff810072112000, task ffff81006b576040) Stack: ffff8100721139f8 ffffffff8027560c 0000000000000000 ffffe20000421000 ffff810072113a28 ffffffff802756e0 ffff810072113a38 ffffffff8027561e ffffe20000421000 ffff81007a8bc6e0 ffff810072113a48 ffffffff8026208c Call Trace: [<ffffffff8027560c>] ? enqueue_huge_page+0x2a/0x3c [<ffffffff802756e0>] free_huge_page+0xc2/0xca [<ffffffff8027561e>] ? free_huge_page+0x0/0xca [<ffffffff8026208c>] put_compound_page+0x61/0x80 [<ffffffff802626ce>] put_page+0x4d/0xf5 [<ffffffff88008e2e>] ? :kvm:kvm_mmu_zap_page+0x1f8/0x262 [<ffffffff88000c81>] :kvm:kvm_release_pfn_clean+0x41/0x64 [<ffffffff88000cbd>] :kvm:kvm_release_pfn_dirty+0x19/0x1d [<ffffffff88008aa0>] :kvm:rmap_remove+0x85/0x194 [<ffffffff88008c88>] :kvm:kvm_mmu_zap_page+0x52/0x262 [<ffffffff88009379>] :kvm:free_mmu_pages+0x1a/0x45 [<ffffffff880093c3>] :kvm:kvm_mmu_destroy+0x1f/0x65 [<ffffffff88002c09>] :kvm:kvm_arch_vcpu_uninit+0x25/0x44 [<ffffffff880018dc>] :kvm:kvm_vcpu_uninit+0x11/0x21 [<ffffffff88024b79>] :kvm_intel:vmx_free_vcpu+0x78/0x8b [<ffffffff88002852>] :kvm:kvm_arch_vcpu_free+0xe/0x10 [<ffffffff88002a67>] :kvm:kvm_arch_destroy_vm+0x103/0x154 [<ffffffff88001478>] :kvm:kvm_put_kvm+0x6d/0x87 [<ffffffff880018b3>] :kvm:kvm_vcpu_release+0x13/0x17 [<ffffffff80284b6a>] __fput+0xb9/0x175 [<ffffffff80284dcb>] fput+0x14/0x16 [<ffffffff802822b5>] filp_close+0x66/0x71 [<ffffffff80231232>] put_files_struct+0x6d/0xc1 [<ffffffff802312c1>] __exit_files+0x3b/0x40 [<ffffffff80232478>] do_exit+0x246/0x659 [<ffffffff80232906>] do_group_exit+0x7b/0x96 [<ffffffff8023a5a1>] get_signal_to_deliver+0x2de/0x30b [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27 [<ffffffff8020a443>] do_notify_resume+0xbd/0x878 [<ffffffff8024985d>] ? do_futex+0xb5/0xa49 [<ffffffff80382801>] ? rb_insert_color+0xb9/0xe3 [<ffffffff80243227>] ? enqueue_hrtimer+0x64/0x6d [<ffffffff802437e1>] ? hrtimer_start+0x117/0x129 [<ffffffff802908ac>] ? sys_select+0x11a/0x17b [<ffffffff8020b154>] ? sysret_signal+0x1c/0x27 [<ffffffff8020b3e7>] ptregscall_common+0x67/0xb0 ----------------------------------------------------------------------- In the simplest case we need only cause the close on /dev/kvm to occur before the tear-down of the huge page file. I tried to force this via flagging the struct file corresponding to /dev/kvm via poking O_SYNC into f_flags at the time of open() from libkvm/libkvm.c:kvm_init(), and forcing two passes over open files in kernel/exit.c:close_files(), first pass to close all struct file(s) flagged via f_flags & O_SYNC and a second pass to close the balance. This didn't pan out as for some reason I wasn't getting into kvm_vcpu_release() ahead of huge file tear-down as expected. So I punted and added a shameless hook in __do_exit() to allow an even uglier hack of allowing a wrapper to kvm_vcpu_release() before getting into __exit_files(). This appears to remedy the sequencing issue preventing the reference to the freed huge page file's inode. Patches are attached which are intended only as documentation of the above, rough validation of the problem cause and approach suggested for solution.