Bug 1047892

Summary: [ 3.13.0-0.rc6.git0.1.fc21.x86_64+debug] Kernel stack trace: ""kernel BUG at mm/page_alloc.c:2788"
Product: [Fedora] Fedora Reporter: Kashyap Chamarthy <kchamart>
Component: kernelAssignee: Marcelo Tosatti <mtosatti>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: unspecified Docs Contact:
Priority: unspecified    
Version: rawhideCC: crobinso, gansalmon, hzguanqiang, itamar, jonathan, kchamart, kernel-maint, knoel, madhu.chinakonda, mtosatti
Target Milestone: ---   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: kernel-3.12.7-300.fc20 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2014-01-14 08:34:26 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
Complete stdout of dmesg
none
Complete stdout of dmidecode none

Description Kashyap Chamarthy 2014-01-02 14:12:06 UTC
Description of problem
----------------------

Booting into 3.13.0-0.rc6.git0.1.fc21.x86_64+debug Kernel causes the below Stack trace (refer below for complete trace):

========
Jan  2 13:17:59 tesla kernel: [54532.561299] ------------[ cut here ]------------
Jan  2 13:17:59 tesla kernel: [54532.561342] kernel BUG at mm/page_alloc.c:2788!
Jan  2 13:17:59 tesla kernel: [54532.561357] invalid opcode: 0000 [#1] SMP 
========


Version info
-------------

  $ uname -r; rpm -q qemu-system-x86
  3.13.0-0.rc6.git0.1.fc21.x86_64+debug
  qemu-system-x86-1.7.0-3.fc21.x86_64


Stack trace from logs
---------------------

========
.
.
.
Jan  2 13:17:59 tesla kernel: [54532.561299] ------------[ cut here ]------------
Jan  2 13:17:59 tesla kernel: [54532.561342] kernel BUG at mm/page_alloc.c:2788!
Jan  2 13:17:59 tesla kernel: [54532.561357] invalid opcode: 0000 [#1] SMP 
Jan  2 13:17:59 tesla kernel: [54532.561374] Modules linked in: cdc_acm usb_storage vfat fat mmc_block vhost_net vhost macvtap macvlan xt_CHECKSUM tun
 bridge stp llc ebtable_nat fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6table_secur
ity ip6table_raw ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle iptable_security iptable_raw nf_conntrack_
ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables openvswitch vxlan ip_tunnel gre libcrc32c dm_crypt iT
CO_wdt iTCO_vendor_support x86_pkg_temp_thermal coretemp kvm_intel kvm crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel microcode snd_hd
a_codec_hdmi snd_hda_codec_conexant uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core serio_raw videodev media arc4 btusb iwldvm bluetooth ma
c80211 snd_hda_intel snd_hda_codec iwlwifi snd_hwdep snd_seq sdhci_pci lpc_ich sdhci snd_seq_device i2c_i801 mfd_core mmc_core cfg80211 snd_pcm e1000e
 ptp pps_core snd_page_alloc mei_me shpchp snd_timer mei wmi thinkpad_acpi snd soundcore rfkill binfmt_misc uinput i915 i2c_algo_bit drm_kms_helper dr
m i2c_core video
Jan  2 13:17:59 tesla kernel: [54532.561783] CPU: 0 PID: 3158 Comm: qemu-system-x86 Not tainted 3.13.0-0.rc6.git0.1.fc21.x86_64+debug #1
Jan  2 13:17:59 tesla kernel: [54532.561817] Hardware name: LENOVO 4291IQ1/4291IQ1, BIOS 8DET63WW (1.33 ) 07/19/2012
Jan  2 13:17:59 tesla kernel: [54532.561840] task: ffff8801e2f325f0 ti: ffff8801df3fe000 task.ti: ffff8801df3fe000
Jan  2 13:17:59 tesla kernel: [54532.561863] RIP: 0010:[<ffffffff8118a21d>]  [<ffffffff8118a21d>] free_pages+0x6d/0x70
Jan  2 13:17:59 tesla kernel: [54532.561891] RSP: 0018:ffff8801df3ffb78  EFLAGS: 00010246
Jan  2 13:17:59 tesla kernel: [54532.561907] RAX: 0000000000000000 RBX: 6b6b6b6b6b6b6b6b RCX: 0000000000000001
Jan  2 13:17:59 tesla kernel: [54532.561929] RDX: 6b6b6b6beb6b6b6b RSI: 0000000000000000 RDI: 6b6be36b6b6b6b6b
Jan  2 13:17:59 tesla kernel: [54532.561950] RBP: ffff8801df3ffb88 R08: ffffea0002edce60 R09: 0000000000000000
Jan  2 13:17:59 tesla kernel: [54532.561971] R10: 0000000000000000 R11: 0000000000000001 R12: 0000000000000000
Jan  2 13:17:59 tesla kernel: [54532.561993] R13: ffff8801d79002c0 R14: ffff8801d79002d8 R15: ffff8801d79030c0
Jan  2 13:17:59 tesla kernel: [54532.562014] FS:  00007ffc1d663700(0000) GS:ffff880214c00000(0000) knlGS:0000000000000000
Jan  2 13:17:59 tesla kernel: [54532.562038] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jan  2 13:17:59 tesla kernel: [54532.562055] CR2: 00007fdf33f94100 CR3: 0000000001c0c000 CR4: 00000000000427e0
Jan  2 13:17:59 tesla kernel: [54532.562076] Stack:
Jan  2 13:17:59 tesla kernel: [54532.562084]  00000000df3ffb88 ffff880009390638 ffff8801df3ffba0 ffffffffa0766f66
Jan  2 13:17:59 tesla kernel: [54532.562113]  ffff8800af0d8000 ffff8801df3ffbb8 ffffffffa076dea3 ffff8800af0d8000
Jan  2 13:17:59 tesla kernel: [54532.562140]  ffff8801df3ffbd0 ffffffffa05c2700 ffff8801d7900000 ffff8801df3ffc20
Jan  2 13:17:59 tesla kernel: [54532.562167] Call Trace:
Jan  2 13:17:59 tesla kernel: [54532.562180]  [<ffffffffa0766f66>] free_loaded_vmcs+0x26/0x30 [kvm_intel]
Jan  2 13:17:59 tesla kernel: [54532.562203]  [<ffffffffa076dea3>] vmx_free_vcpu+0x33/0x70 [kvm_intel]
Jan  2 13:17:59 tesla kernel: [54532.562232]  [<ffffffffa05c2700>] kvm_arch_vcpu_free+0x50/0x60 [kvm]
Jan  2 13:17:59 tesla kernel: [54532.562259]  [<ffffffffa05c3312>] kvm_arch_destroy_vm+0x102/0x240 [kvm]
Jan  2 13:17:59 tesla kernel: [54532.562281]  [<ffffffff810ed77d>] ? synchronize_srcu+0x1d/0x20
Jan  2 13:17:59 tesla kernel: [54532.562300]  [<ffffffff811d3a8e>] ? mmu_notifier_unregister+0xee/0x130
Jan  2 13:17:59 tesla kernel: [54532.562331]  [<ffffffffa05a8be1>] kvm_put_kvm+0xe1/0x190 [kvm]
Jan  2 13:17:59 tesla kernel: [54532.562353]  [<ffffffffa05a8cc8>] kvm_vcpu_release+0x18/0x20 [kvm]
Jan  2 13:17:59 tesla kernel: [54532.562372]  [<ffffffff811ff525>] __fput+0xf5/0x2c0
Jan  2 13:17:59 tesla kernel: [54532.562388]  [<ffffffff811ff73e>] ____fput+0xe/0x10
Jan  2 13:17:59 tesla kernel: [54532.562403]  [<ffffffff8109ca04>] task_work_run+0xb4/0xe0
Jan  2 13:17:59 tesla kernel: [54532.562421]  [<ffffffff81077664>] do_exit+0x2e4/0xcf0
Jan  2 13:17:59 tesla kernel: [54532.562437]  [<ffffffff810780fc>] do_group_exit+0x4c/0xc0
Jan  2 13:17:59 tesla kernel: [54532.562454]  [<ffffffff8108ae11>] get_signal_to_deliver+0x2d1/0x930
Jan  2 13:17:59 tesla kernel: [54532.562474]  [<ffffffff81019518>] do_signal+0x48/0x610
Jan  2 13:17:59 tesla kernel: [54532.562491]  [<ffffffff811018a6>] ? do_futex+0xe6/0xd00
Jan  2 13:17:59 tesla kernel: [54532.562509]  [<ffffffff810ab2df>] ? finish_task_switch+0x3f/0x120
Jan  2 13:17:59 tesla kernel: [54532.562528]  [<ffffffff810d1c8d>] ? trace_hardirqs_on+0xd/0x10
Jan  2 13:17:59 tesla kernel: [54532.563423]  [<ffffffff81019b50>] do_notify_resume+0x70/0xa0
Jan  2 13:17:59 tesla kernel: [54532.564274]  [<ffffffff81771562>] int_signal+0x12/0x17
Jan  2 13:17:59 tesla kernel: [54532.565134] Code: 00 00 00 ea ff ff 48 01 d3 48 0f 42 05 0d 8e a8 00 48 01 c3 48 c1 eb 0c 48 c1 e3 06 48 01 df e8 2a ff ff ff 48 83 c4 08 5b 5d c3 <0f> 0b 90 66 66 66 66 90 55 48 85 ff 48 89 e5 41 55 49 89 fd 41 
Jan  2 13:17:59 tesla kernel: [54532.567034] RIP  [<ffffffff8118a21d>] free_pages+0x6d/0x70
Jan  2 13:17:59 tesla kernel: [54532.567926]  RSP <ffff8801df3ffb78>
Jan  2 13:17:59 tesla kernel: [54532.572384] ---[ end trace ae34a7cb7620f953 ]---
Jan  2 13:17:59 tesla kernel: [54532.572389] Fixing recursive fault but reboot is needed!
========


Additional info
----------------

If it matters, I built QEMU from Fedora master git:

    $ sudo yum install fedpkg  -y
    $ sudo yum-builddep qemu
    $ fedpkg clone -B -a qemu
    $ cd qemu/master
    $ git log | head -1
    commit c4896d008b4e71e0cdbc505d8ff8849f830ac531
    $ fedpkg local
    $ sudo yum localupdate x86_64/*

Comment 1 Kashyap Chamarthy 2014-01-02 14:16:27 UTC
Created attachment 844569 [details]
Complete stdout of dmesg

Comment 2 Kashyap Chamarthy 2014-01-02 14:17:08 UTC
Created attachment 844570 [details]
Complete stdout of dmidecode

Comment 3 Marcelo Tosatti 2014-01-02 17:25:10 UTC
Kashyap Chamarthy,

1) Which sort of guest was executing when this issue occurred?
2) Is it reproducible?

Comment 4 Kashyap Chamarthy 2014-01-02 18:20:19 UTC
(In reply to Marcelo Tosatti from comment #3)
> Kashyap Chamarthy,
> 
> 1) Which sort of guest was executing when this issue occurred?

Fedora 20 guest.

> 2) Is it reproducible?

Cannot say definitively. I reported this on the first occurrence. I'd have to reboot my laptop again to see if I can reproduce this.

I should add some more context: I noticed this issue when I found an un-killable defunct qemu process (I ended up with it when I tried to force power-off the guest via `virsh`), and I had to reboot the host.

  $ virsh destroy ostack-compute 
  error: Failed to destroy domain 3
  error: Failed to terminate process 3152 with SIGKILL: Device or resource busy

NOTE - The disk image is *not* on NFS mount. It's on local disk:

 $ virsh domblklist ostack-compute
  Target     Source
  ------------------------------------------------
  vda        /home/kashyap/vmimages/ostack-compute.qcow2


Try to destroy the guest again, with LIBVIRT_DEBUG enabled, that's what I see in a loop:
=============
.
.
.
2014-01-02 12:42:54.012+0000: 27462: debug : virKeepAliveCheckMessage:395 : Got keepalive request from client 0x7f4dfaca75d0
2014-01-02 12:42:54.012+0000: 27462: debug : virNetMessageNew:44 : msg=0x7f4dfaca7770 tracked=0
2014-01-02 12:42:54.012+0000: 27462: debug : virNetMessageEncodePayloadEmpty:479 : Encode length as 28
2014-01-02 12:42:54.012+0000: 27462: debug : virKeepAliveMessage:101 : Sending keepalive response to client 0x7f4dfaca75d0
2014-01-02 12:42:54.012+0000: 27462: debug : virKeepAliveMessage:104 : RPC_KEEPALIVE_SEND: ka=0x7f4dfaca78c0 client=0x7f4dfaca75d0 prog=1801807216 vers=1 proc=2
2014-01-02 12:42:54.012+0000: 27462: debug : virNetClientQueueNonBlocking:1926 : RPC_CLIENT_MSG_TX_QUEUE: client=0x7f4dfaca75d0 len=28 prog=1801807216 vers=1 proc=2 type=2 status=0 serial=0
2014-01-02 12:42:54.012+0000: 27462: debug : virNetClientCallNew:1905 : New call 0x7f4dfaca7ae0: msg=0x7f4dfaca7770, expectReply=0, nonBlock=1
2014-01-02 12:42:54.012+0000: 27462: debug : virNetMessageClear:55 : msg=0x7f4dfaca7638 nfds=0
2014-01-02 12:42:54.012+0000: 27462: debug : virNetClientIOEventLoopRemoveDone:1379 : Removing completed call 0x7f4dfaca7ae0
2014-01-02 12:42:59.172+0000: 27462: debug : virNetMessageDecodeLength:149 : Got length, now need 28 total (24 more)
2014-01-02 12:42:59.172+0000: 27462: debug : virNetClientCallDispatch:1123 : RPC_CLIENT_MSG_RX: client=0x7f4dfaca75d0 len=28 prog=1801807216 vers=1 proc=1 type=2 status=0 serial=0
2014-01-02 12:42:59.172+0000: 27462: debug : virKeepAliveCheckMessage:374 : ka=0x7f4dfaca78c0, client=0x7f4dfaca75d0, msg=0x7f4dfaca7638
2014-01-02 12:42:59.172+0000: 27462: debug : virKeepAliveCheckMessage:391 : RPC_KEEPALIVE_RECEIVED: ka=0x7f4dfaca78c0 client=0x7f4dfaca75d0 prog=1801807216 vers=1 proc=1
.
.
.
=============

And:

  $ ps -ef | grep qemu
  qemu      3152     1 30  2013 ?        13:30:42 [qemu-system-x86] <defunct>

  $ pstree 3152
  qemu-system-x86───{qemu-system-x86}


Libvirt & QEMU version on the host:

  $ rpm -q libvirt-daemon-kvm qemu-system-x86
  libvirt-daemon-kvm-1.2.0-1.fc21.x86_64
  qemu-system-x86-1.7.0-3.fc21.x86_64

Comment 5 Marcelo Tosatti 2014-01-02 20:16:32 UTC
Are you using nested virtualization?

Comment 6 Kashyap Chamarthy 2014-01-02 21:04:05 UTC
(In reply to Marcelo Tosatti from comment #5)
> Are you using nested virtualization?

Yes.

  $ modinfo kvm_intel | grep -i nested
  parm:           nested:bool


And, just to be clear:

 - The crash was seen on reboot of *bare-metal* host which has Nested enabled.
 - The said guest (ostack-compute.qcow2) in comment 4 has host CPU 
   to the guest, i.e. I used:

     <cpu mode='host-passthrough'>
     </cpu>

   for the guest (ostack-compute.qcow2)

Comment 7 Marcelo Tosatti 2014-01-02 21:59:53 UTC
(In reply to Kashyap Chamarthy from comment #6)
> (In reply to Marcelo Tosatti from comment #5)
> > Are you using nested virtualization?
> 
> Yes.
> 
>   $ modinfo kvm_intel | grep -i nested
>   parm:           nested:bool
> 
> 
> And, just to be clear:
> 
>  - The crash was seen on reboot of *bare-metal* host which has Nested
> enabled.
>  - The said guest (ostack-compute.qcow2) in comment 4 has host CPU 
>    to the guest, i.e. I used:
> 
>      <cpu mode='host-passthrough'>
>      </cpu>
> 
>    for the guest (ostack-compute.qcow2)

Ok, are you actually using nested guests in the ostack-compute.qcow2 guest? TIA

Comment 8 Kashyap Chamarthy 2014-01-03 08:55:53 UTC
(In reply to Marcelo Tosatti from comment #7)
> (In reply to Kashyap Chamarthy from comment #6)
> > (In reply to Marcelo Tosatti from comment #5)
> > > Are you using nested virtualization?
> > 
> > Yes.
> > 
> >   $ modinfo kvm_intel | grep -i nested
> >   parm:           nested:bool
> > 
> > 
> > And, just to be clear:
> > 
> >  - The crash was seen on reboot of *bare-metal* host which has Nested
> > enabled.
> >  - The said guest (ostack-compute.qcow2) in comment 4 has host CPU 
> >    to the guest, i.e. I used:
> > 
> >      <cpu mode='host-passthrough'>
> >      </cpu>
> > 
> >    for the guest (ostack-compute.qcow2)
> 
> Ok, are you actually using nested guests in the ostack-compute.qcow2 guest?

Yes. The ostack-compute.qcow2 is the Compute host of OpenStack set-up which runs Nova instances (Cirros and Fedora). Also, if you prefer, I can post QEMU command-lines and libvirt XMLs of L1, L2.

Comment 9 Marcelo Tosatti 2014-01-03 21:15:08 UTC
Patch posted: http://article.gmane.org/gmane.comp.emulators.kvm.devel/117837

Comment 10 Josh Boyer 2014-01-06 19:21:19 UTC
Patch applied.  Thanks all!

Comment 11 Fedora Update System 2014-01-11 16:16:50 UTC
kernel-3.12.7-300.fc20 has been submitted as an update for Fedora 20.
https://admin.fedoraproject.org/updates/kernel-3.12.7-300.fc20

Comment 12 Fedora Update System 2014-01-11 16:20:16 UTC
kernel-3.12.7-200.fc19 has been submitted as an update for Fedora 19.
https://admin.fedoraproject.org/updates/kernel-3.12.7-200.fc19

Comment 13 Fedora Update System 2014-01-12 04:59:43 UTC
Package kernel-3.12.7-200.fc19:
* should fix your issue,
* was pushed to the Fedora 19 testing repository,
* should be available at your local mirror within two days.
Update it with:
# su -c 'yum update --enablerepo=updates-testing kernel-3.12.7-200.fc19'
as soon as you are able to, then reboot.
Please go to the following url:
https://admin.fedoraproject.org/updates/FEDORA-2014-0684/kernel-3.12.7-200.fc19
then log in and leave karma (feedback).

Comment 14 Fedora Update System 2014-01-14 08:34:26 UTC
kernel-3.12.7-200.fc19 has been pushed to the Fedora 19 stable repository.  If problems still persist, please make note of it in this bug report.

Comment 15 Fedora Update System 2014-01-14 08:37:40 UTC
kernel-3.12.7-300.fc20 has been pushed to the Fedora 20 stable repository.  If problems still persist, please make note of it in this bug report.