Bug 619268 - rmmod kvm modules cause host kernel panic
Summary: rmmod kvm modules cause host kernel panic
Keywords:
Status: CLOSED ERRATA
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kvm
Version: 5.6
Hardware: All
OS: Linux
low
medium
Target Milestone: rc
: ---
Assignee: Eduardo Habkost
QA Contact: Virtualization Bugs
URL:
Whiteboard:
Depends On:
Blocks: Rhel5KvmTier1
TreeView+ depends on / blocked
 
Reported: 2010-07-29 06:22 UTC by Suqin Huang
Modified: 2013-01-09 22:57 UTC (History)
6 users (show)

Fixed In Version: kvm-83-199.el5
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2011-01-13 23:37:07 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)


Links
System ID Private Priority Status Summary Last Updated
Red Hat Product Errata RHSA-2011:0028 0 normal SHIPPED_LIVE Low: kvm security and bug fix update 2011-01-13 11:03:39 UTC

Description Suqin Huang 2010-07-29 06:22:16 UTC
Description of problem:


Version-Release number of selected component (if applicable):
kvm-83-191.el5

How reproducible:
sometime

Steps to Reproduce:
1. start kvm_stat, and repeat rmmod kvm modules and modprobe kvm modules 

modprobe  ksm
modprobe  kvm_amd
modprobe  kvm
modprobe -r ksm
modprobe -r kvm_amd
modprobe -r kvm

2. 

3.
  
Actual results:
host kernel panic

Expected results:


Additional info:
1. host kernel:
2.6.18-209.el5

2. kernel panic:
unable to handle kernel paging request at ffffffff88410f20 RIP: 
 [<ffffffff8001ea91>] __dentry_open+0x6e/0x1dc
PGD 203067 PUD 205063 PMD 21bd81067 PTE 0
Oops: 0000 [1] SMP 
last sysfs file: /module/kvm/version
CPU 2 
Modules linked in: nls_utf8 vfat fat nfsd exportfs auth_rpcgss tun nfs fscache nfs_acl autofs4 hidp rfcomm l2cap bluetooth lockd sunrpc cpufreq_ondemand powernow_k8 freq_table bridge ipv6 xfrm_nalgo crypto_api loop dm_multipath scsi_dh video backlight sbs power_meter hwmon i2c_ec dell_wmi wmi button battery asus_acpi acpi_memhotplug ac parport_pc lp parport floppy snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm snd_timer snd_page_alloc snd_hwdep tpm_infineon snd tpm tg3 i2c_piix4 shpchp tpm_bios i2c_core sr_mod cdrom soundcore amd64_edac_mod edac_mc serio_raw sg pcspkr dm_raid45 dm_message dm_region_hash dm_mem_cache dm_snapshot dm_zero dm_mirror dm_log dm_mod ahci libata sd_mod scsi_mod ext3 jbd uhci_hcd ohci_hcd ehci_hcd
Pid: 16409, comm: kvm_stat Tainted: G      2.6.18-209.el5 #1
RIP: 0010:[<ffffffff8001ea91>]  [<ffffffff8001ea91>] __dentry_open+0x6e/0x1dc
RSP: 0018:ffff81021a9ffe68  EFLAGS: 00010282
RAX: ffffffff88410f20 RBX: ffff81021cb03980 RCX: ffff81021cb03980
RDX: 0000000000000000 RSI: ffff8101077ae280 RDI: ffff81021f447228
RBP: ffff81021fdc1910 R08: 0000000000000000 R09: ffffff9c8c94d000
R10: ffff81021cb03980 R11: ffffffff8002c51d R12: 00000000ffffff9c
R13: 0000000000000000 R14: ffff8101077ae280 R15: ffff81021f447228
FS:  00002b0588461190(0000) GS:ffff81021fc1ce40(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffffff88410f20 CR3: 000000021d36a000 CR4: 00000000000006e0
Process kvm_stat (pid: 16409, threadinfo ffff81021a9fe000, task ffff81021d2550c0)
Stack:  00002b058c94c000 0000000000008000 0000000000008000 00000000ffffff9c
 0000000000000011 ffff81016e3bb000 0000000000000000 ffffffff800275aa
 ffff81021f447228 ffff8101077ae280 ffff81021a9ffdb8 ffff81021a9ffdb8
Call Trace:
 [<ffffffff800275aa>] do_filp_open+0x2a/0x38
 [<ffffffff80019e81>] do_sys_open+0x44/0xbe
 [<ffffffff8005d28d>] tracesys+0xd5/0xe0


Code: 48 8b 10 48 85 d2 74 1e 65 8b 04 25 2c 00 00 00 83 3a 02 74 
RIP  [<ffffffff8001ea91>] __dentry_open+0x6e/0x1dc
 RSP <ffff81021a9ffe68>
CR2: ffffffff88410f20
 <0>Kernel panic - not syncing: Fatal exception

Comment 3 Eduardo Habkost 2010-09-16 20:19:58 UTC
Found the likely cause: the problem is that debugfs_remove() is called when kvm-intel (or kvm-amd) is unloaded, but the .owner field of the debugfs fops structs point to the kvm.ko module.

This way, we may end up calling debugfs_remove() while the KVM debugfs files are still open, and that shouldn't be allowed to happen.

Comment 4 Eduardo Habkost 2010-09-17 21:19:56 UTC
After some analysis, I concluded that the wrong .owner field may be a problem after the file is already open but the crash here is before fops_get() returns inside __dentry_open(). The main issue seems to be a potential race on __dentry_open(). I will ask for feedback on rhkernel-list as I am not sure I didn't miss anything when reading the sys_open() and module unload codepath.

Rewording BZ summary to make it explicit that a module unload is necessary to reproduce it (making it less serious).

Comment 5 Eduardo Habkost 2010-09-17 22:10:59 UTC
Even without the potential race condition on __dentry_open(), the crash may be reproduced more easily without any complex race condition, by just doing this on an Intel machine:

modprobe kvm
modprobe kvm-amd
rmmod kvm
cat /sys/kernel/debug/kvm/largepages

Maybe there is a race condition too, but the more easy-to-reproduce case doesn't involve a race condition, just failure to clean up after errors on kvm_init().

@Suqin Huang: are you able to reproduce this only using "modprobe kvm-amd" on Intel machines (or vice-versa), or it is reproducible also when you are loading the right module? (kvm-intel on Intel machine or kvm-amd on AMD machine)

Comment 6 Suqin Huang 2010-09-19 03:01:40 UTC
could not rmmod kvm_intel module on AMD machine.
it reproduce when I load/unload right module.

Comment 12 Suqin Huang 2010-11-03 09:14:55 UTC
try 500 times
fixed on kvm-83-207.el5
kernel: 2.6.18-230.el5

Comment 14 Amos Kong 2010-12-17 02:48:14 UTC
This bug was reproduced with 2.6.18-194.30.1.el5 + kvm-83-164.el5_5.30,
do we need clone it to 5.5.z ?

Comment 17 errata-xmlrpc 2011-01-13 23:37:07 UTC
An advisory has been issued which should help the problem
described in this bug report. This report is therefore being
closed with a resolution of ERRATA. For more information
on therefore solution and/or where to find the updated files,
please follow the link below. You may reopen this bug report
if the solution does not work for you.

http://rhn.redhat.com/errata/RHSA-2011-0028.html


Note You need to log in before you can comment on or make changes to this bug.