975065 – [abrt] BUG: unable to handle kernel NULL pointer dereference at (null)

Bug 975065 - [abrt] BUG: unable to handle kernel NULL pointer dereference at (null)

Summary: [abrt] BUG: unable to handle kernel NULL pointer dereference at (null)

Keywords:
Status:	CLOSED ERRATA
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	kernel
Sub Component:
Version:	18
Hardware:	x86_64
OS:	Unspecified
Priority:	unspecified
Severity:	unspecified
Target Milestone:	---
Assignee:	Kernel Maintainer List
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:	abrt_hash:3f3bb8f8edae2f605f5e210d4d2...
Duplicates (9):	975187 975306 975615 975765 976038 976363 978785 979644 982437 (view as bug list)
Depends On:
Blocks:
TreeView+	depends on / blocked

Reported:	2013-06-17 14:33 UTC by James Boyle
Modified:	2013-08-23 21:32 UTC (History)
CC List:	27 users (show)
Fixed In Version:
Clone Of:
Environment:
Last Closed:	2013-08-23 21:32:03 UTC
Type:	---
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
File: dmesg (84.77 KB, text/plain) 2013-06-17 14:33 UTC, James Boyle	no flags	Details
debugging patch 1 (1.73 KB, patch) 2013-07-04 09:57 UTC, Michael S. Tsirkin	no flags	Details \| Diff
debugging patch 2 (obsoletes 1) (1.72 KB, patch) 2013-07-04 10:01 UTC, Michael S. Tsirkin	no flags	Details \| Diff
Patch to fix the linear part len. (857 bytes, patch) 2013-07-09 09:37 UTC, jason wang	no flags	Details \| Diff
/var/log/message of failure with Freebsd 9.1 guest with virtio blk/net , running ezjail-admin update -b -p (87.15 KB, text/plain) 2013-07-17 02:53 UTC, Izhar Firdaus	no flags	Details
updated-patches-3.9.9 (3.90 KB, patch) 2013-07-18 15:53 UTC, James Boyle	no flags	Details \| Diff
Show Obsolete (1) View All

Description James Boyle 2013-06-17 14:33:35 UTC

Description of problem:
I noticed the kernel oops in the logs this morning.

Additionally, a Windows 7 VM is now unable to use the RH VirtIO virtual NIC (Other VMs' networking is functioning just fine).  I also attempted to shutdown the Windows 7 VM gracefully.  It never got past "Logging Off".  
Now, "#virsh destroy mnich-win7" fails with:
error: Failed to destroy domain mnich-win7
error: operation failed: failed to kill qemu process with SIGTERM

virt-manager now cannot connect.

I can't kill the qemu-kvm process manually either, i.e. it's a wedged zombie:
qemu      3117  8.9 26.0 11249264 8563280 ?    D    Jun14 391:02 /usr/bin/qemu-kvm -name mnich-win7 -S -M pc-1.2 -enable-kvm -m 8192 -smp 2,sockets=2,cores=1,threads=1 -uuid aba8fb50-d883-d9ce-3706-51574b8dad78 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mnich-win7.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-ccid,id=ccid0 -drive file=/raid1/vms/mnich-win7.img,if=none,id=drive-virtio-disk0,format=qed,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/raid1/iso/virtio-win-0.1-59.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=22,id=hostnet0,vhost=on,vhostfd=23 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b7:97:fb,bus=pci.0,addr=0x3 -device ccid-card-emulated,backend=nss-emulated,id=smartcard0,bus=ccid0.0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5900,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6

I tried killing the process with SIGTERM, SIGILL, SIGSEGV, and SIGKILL.  It doesn't notice at all.

I think rebooting is all I can do now.

Additional info:
reporter:       libreport-2.1.4
BUG: unable to handle kernel NULL pointer dereference at           (null)
IP: [<ffffffff81141851>] put_page+0x11/0x60
PGD 0 
Oops: 0000 [#1] SMP 
Modules linked in: lp parport tcp_lp ebtable_nat xt_CHECKSUM bridge stp llc fuse nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi iscsi_boot_sysfs bnep bluetooth bnx2i cnic uio ebtable_filter cxgb4i cxgb4 ebtables rfkill cxgb3i ip6table_filter ip6_tables cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek raid1 snd_hda_intel snd_hda_codec iTCO_wdt dcdbas snd_hwdep snd_seq snd_seq_device iTCO_vendor_support snd_pcm vhost_net snd_page_alloc tun macvtap macvlan acpi_cpufreq mperf coretemp snd_timer snd i2c_i801 soundcore e1000e ptp pps_core mei lpc_ich mfd_core usblp kvm_intel kvm microcode tpm_tis joydev serio_raw tpm tpm_bios binfmt_misc uinput dm_crypt radeon crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper ttm ghash_clmulni_intel drm i2c_core video
CPU 1 
Pid: 3118, comm: vhost-3117 Not tainted 3.9.5-201.fc18.x86_64 #1 Dell Inc. OptiPlex 9010
RIP: 0010:[<ffffffff81141851>]  [<ffffffff81141851>] put_page+0x11/0x60
RSP: 0018:ffff8804ba6e7bf8  EFLAGS: 00010292
RAX: ffff8807bd9f1cc0 RBX: 0000000000000000 RCX: ffffea000c6b801c
RDX: 0000000000000140 RSI: 0000000000000246 RDI: 0000000000000000
RBP: ffff8804ba6e7c08 R08: ffff8807a3774518 R09: 0000000000000010
R10: 0000000000000000 R11: 00007f0c40000000 R12: ffff880757f5a700
R13: ffffffffa01b3b2c R14: ffff880757f5a700 R15: ffff8807c00e8800
FS:  0000000000000000(0000) GS:ffff88081dc80000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 0000000696f6e000 CR4: 00000000001427f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vhost-3117 (pid: 3118, threadinfo ffff8804ba6e6000, task ffff880760768000)
Stack:
 ffffea000c81a540 0000000000000012 ffff8804ba6e7c28 ffffffff81545e3f
 ffff880757f5a700 ffff880757f5a700 ffff8804ba6e7c48 ffffffff81545ede
 ffffea000cacbe80 ffff8807a37743d8 ffff8804ba6e7c78 ffffffff81545f96
Call Trace:
 [<ffffffff81545e3f>] skb_release_data+0x8f/0x110
 [<ffffffff81545ede>] __kfree_skb+0x1e/0xa0
 [<ffffffff81545f96>] kfree_skb+0x36/0xa0
 [<ffffffffa01b3b2c>] tun_get_user+0x69c/0x7a0 [tun]
 [<ffffffffa01b3c87>] tun_sendmsg+0x57/0x80 [tun]
 [<ffffffffa02fad87>] handle_tx+0x287/0x680 [vhost_net]
 [<ffffffffa02fb1b5>] handle_tx_kick+0x15/0x20 [vhost_net]
 [<ffffffffa02f795d>] vhost_worker+0xed/0x190 [vhost_net]
 [<ffffffffa02f7870>] ? vhost_work_flush+0x110/0x110 [vhost_net]
 [<ffffffff81082ae0>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_flush+0x20/0xb0
 [<ffffffff81082a20>] ? kthread_create_on_node+0x120/0x120
 [<ffffffff8166a22c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81082a20>] ? kthread_create_on_node+0x120/0x120
Code: 45 fc 65 48 01 04 25 70 02 01 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> f7 07 00 c0 00 00 75 34 8b 47 1c 85 c0 74 1a f0 ff 4b 1c 0f 
RIP  [<ffffffff81141851>] put_page+0x11/0x60
 RSP <ffff8804ba6e7bf8>
CR2: 0000000000000000

Comment 1 James Boyle 2013-06-17 14:33:41 UTC

Created attachment 762043 [details]
File: dmesg

Comment 2 Andrew Jones 2013-06-18 08:41:08 UTC

*** Bug 975187 has been marked as a duplicate of this bug. ***

Comment 3 Josh Boyer 2013-06-18 11:45:45 UTC

*** Bug 975306 has been marked as a duplicate of this bug. ***

Comment 4 Andrew Jones 2013-06-19 07:54:32 UTC

*** Bug 975615 has been marked as a duplicate of this bug. ***

Comment 5 Andrew Jones 2013-06-19 08:12:57 UTC

Based on reports this appears to be a regression from 3.9.5-200 to 3.9.5-201, I'm suspicious of Fedora commit

commit ce5fae0641cc7b51f5dd43ce2c4b00572d6e5bd0
Author: Josh Boyer <jwboyer>
Date:   Tue Jun 11 09:26:41 2013 -0400

    Add two patches to fix issues with vhost_net and macvlan (rhbz 954181)

which is a backport of two patch from Jason Wang. Adding him to CC. I've already added Michael Tsirkin.

Comment 6 Josh Boyer 2013-06-19 12:34:25 UTC

Please test this scratch build when it completes:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5520279

It contains the currently queued networking patches for 3.9.y stable, which includes a number of patches Jason asked to be included.

Comment 7 Brad Hubbard 2013-06-19 22:09:40 UTC

I've got this installed. I can pretty much make this happen at will by mounting an iscsi disk exported from the f19 host on a rhel6 VM and then doing something like a find on the mount. I should be able to report back success/failure later today.

Comment 8 Brad Hubbard 2013-06-20 00:53:34 UTC

No joy.

[ 9241.659487] Modules linked in: vhost_net macvtap macvlan hidp fuse ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle tun bridge stp llc rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_conntrack nf_conntrack ip6_tables xfs libcrc32c snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm arc4 iwldvm mac80211 mperf coretemp snd_page_alloc btusb kvm_intel bluetooth snd_timer kvm iTCO_wdt iTCO_vendor_support iwlwifi uvcvideo lpc_ich cfg80211 microcode i2c_i801 thinkpad_acpi videobuf2_vmalloc sdhci_pci mei rfkill videobuf2_memops sdhci videobuf2_core videodev mmc_core e1000e ptp snd soundcore pps_core wmi media mfd_core uinput dm_crypt crc32_pclmul
[ 9241.659917]  crc32c_intel i915 ghash_clmulni_intel i2c_algo_bit drm_kms_helper firewire_ohci drm firewire_core usb_storage crc_itu_t i2c_core video
[ 9241.659979] CPU 1 
[ 9241.659991] Pid: 4293, comm: vhost-4292 Not tainted 3.9.6-301.4.fc19.x86_64 #1 LENOVO 4243BQ3/4243BQ3
[ 9241.660028] RIP: 0010:[<ffffffff8113caaa>]  [<ffffffff8113caaa>] put_page+0xa/0x60
[ 9241.660063] RSP: 0018:ffff8801187adc78  EFLAGS: 00010202
[ 9241.660084] RAX: ffff8801a2666cc0 RBX: 0000000000000012 RCX: ffff8801a2666cc0
[ 9241.660113] RDX: 0000000000000140 RSI: ffff8801a2666cc0 RDI: 0980000100fc0056
[ 9241.660140] RBP: ffff8801187adc80 R08: 0000000000000001 R09: 0000000000000010
[ 9241.660167] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880058711b00
[ 9241.660210] R13: ffffffffa070efd4 R14: 000000000000000c R15: ffff880058711b00
[ 9241.660238] FS:  0000000000000000(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
[ 9241.660269] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 9241.660291] CR2: 00007f2556f21000 CR3: 0000000153e00000 CR4: 00000000000427e0
[ 9241.660336] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 9241.660364] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 9241.660391] Process vhost-4292 (pid: 4293, threadinfo ffff8801187ac000, task ffff88015d301770)
[ 9241.660423] Stack:
[ 9241.660433]  0000000000000012 ffff8801187adca0 ffffffff8152cb87 ffff880058711b00
[ 9241.660467]  ffff880058711b00 ffff8801187adcb8 ffffffff8152cc1a ffff8801187c4800
[ 9241.660501]  ffff8801187adce0 ffffffff8152cce2 ffff8801187c4800 000000000000c162
[ 9241.660535] Call Trace:
[ 9241.660548]  [<ffffffff8152cb87>] skb_release_data+0x87/0x100
[ 9241.660572]  [<ffffffff8152cc1a>] __kfree_skb+0x1a/0xb0
[ 9241.660593]  [<ffffffff8152cce2>] kfree_skb+0x32/0x90
[ 9241.660616]  [<ffffffffa070efd4>] tun_get_user+0x724/0x810 [tun]
[ 9241.660641]  [<ffffffffa070f117>] tun_sendmsg+0x57/0x80 [tun]
[ 9241.660666]  [<ffffffffa0788a58>] handle_tx+0x1c8/0x640 [vhost_net]
[ 9241.660691]  [<ffffffffa0788f05>] handle_tx_kick+0x15/0x20 [vhost_net]
[ 9241.660718]  [<ffffffffa078581d>] vhost_worker+0xed/0x190 [vhost_net]
[ 9241.660745]  [<ffffffffa0785730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[ 9241.660775]  [<ffffffff81080200>] kthread+0xc0/0xd0
[ 9241.660796]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[ 9241.660822]  [<ffffffff8164ed2c>] ret_from_fork+0x7c/0xb0
[ 9241.660845]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[ 9241.660868] Code: 00 00 01 75 f3 e9 b6 fe ff ff be 47 01 00 00 48 c7 c7 6f 51 9d 81 e8 46 02 f2 ff e9 a0 fe ff ff 90 66 66 66 66 90 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[ 9241.661044] RIP  [<ffffffff8113caaa>] put_page+0xa/0x60
[ 9241.661067]  RSP <ffff8801187adc78>
[ 9241.671158] ---[ end trace bb59724f985e6711 ]---

Comment 9 Andrew Jones 2013-06-20 09:45:23 UTC

*** Bug 975765 has been marked as a duplicate of this bug. ***

Comment 10 jason wang 2013-06-20 10:12:31 UTC

I could not reproduce the issue, I try both win7 and rhel6 guest. Was the issue reproduced 100%? And I need the following information:

- What's the network configuration (firewall, bonding or else) in host?
- I notice that NAT were used, does the issue disappear when you use bridge?
- What does ifconfig $tap looks like after you meet this issue?

Thanks

Comment 11 Brad Hubbard 2013-06-20 10:30:41 UTC

virbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.122.1  netmask 255.255.255.0  broadcast 192.168.122.255
        ether 52:54:00:1e:b9:2a  txqueuelen 0  (Ethernet)
        RX packets 21587  bytes 1721347 (1.6 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 23259  bytes 25956588 (24.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vnet0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::fc54:ff:feea:2884  prefixlen 64  scopeid 0x20<link>
        ether fe:54:00:ea:28:84  txqueuelen 500  (Ethernet)
        RX packets 8191  bytes 746109 (728.6 KiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 14252  bytes 6931280 (6.6 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vnet1: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::fc54:ff:fec7:b1b  prefixlen 64  scopeid 0x20<link>
        ether fe:54:00:c7:0b:1b  txqueuelen 500  (Ethernet)
        RX packets 13396  bytes 1277456 (1.2 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 22550  bytes 19703863 (18.7 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

wlp3s0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.1.110  netmask 255.255.255.0  broadcast 192.168.1.255
        inet6 fe80::2677:3ff:fe7a:6da0  prefixlen 64  scopeid 0x20<link>
        ether 24:77:03:7a:6d:a0  txqueuelen 1000  (Ethernet)
        RX packets 1072277  bytes 1206021906 (1.1 GiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 959151  bytes 141807014 (135.2 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

tun0 as well but I will not post those details and I don't think it relevant.

$ sudo iptables -L -n -t nat
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         

Chain INPUT (policy ACCEPT)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination         
MASQUERADE  tcp  --  192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
MASQUERADE  udp  --  192.168.122.0/24    !192.168.122.0/24     masq ports: 1024-65535
MASQUERADE  all  --  192.168.122.0/24    !192.168.122.0/24 

$ iptables -L -n
iptables v1.4.18: can't initialize iptables table `filter': Permission denied (you must be root)
Perhaps iptables or your kernel needs to be upgraded.
[brad@dhcp-1-109 14-05-2013]$ sudo iptables -L -n
[sudo] password for brad: 
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:53
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:53
ACCEPT     udp  --  0.0.0.0/0            0.0.0.0/0            udp dpt:67
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0            tcp dpt:67

Chain FORWARD (policy ACCEPT)
target     prot opt source               destination         
ACCEPT     all  --  0.0.0.0/0            192.168.122.0/24     ctstate RELATED,ESTABLISHED
ACCEPT     all  --  192.168.122.0/24     0.0.0.0/0           
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-port-unreachable
REJECT     all  --  0.0.0.0/0            0.0.0.0/0            reject-with icmp-port-unreachable

$ sudo tgtadm --lld iscsi --op show --mode target
Target 1: dhcp-1-109.bne.xxx.com:server.target1
    System information:
        Driver: iscsi
        State: ready
    I_T nexus information:
        I_T nexus: 1
            Initiator: iqn.xxxx-05.com.xxx:c42dff198233 alias: rhel6
            Connection: 0
                IP Address: 192.168.122.118
        I_T nexus: 2
            Initiator: iqn.xxxx-05.com.xxx:196d64165ef alias: localhost.localdomain
            Connection: 0
                IP Address: 192.168.122.65
    LUN information:
        LUN: 0
            Type: controller
            SCSI ID: IET     00010000
            SCSI SN: beaf10
            Size: 0 MB, Block size: 1
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Thin-provisioning: No
            Backing store type: null
            Backing store path: None
            Backing store flags: 
        LUN: 1
            Type: disk
            SCSI ID: IET     00010001
            SCSI SN: beaf11
            Size: 26844 MB, Block size: 512
            Online: Yes
            Removable media: No
            Prevent removal: No
            Readonly: No
            Thin-provisioning: No
            Backing store type: rdwr
            Backing store path: /dev/mapper/vg-scratch
            Backing store flags: 
    Account information:
    ACL information:
        ALL

I'll gather the ifconfig next time I see the issue and will consider testing with bridging when I have available time. I'm also working on gathering a vmcore but kdump doesn't like the fact my /home partition is encrypted apparently.

Comment 12 jason wang 2013-06-20 11:27:52 UTC

Nothing special, thanks for the information.

Please try the following build when it was finished (which contains some debug logs before crash which can help in analysing the root cause) and collect the console output.

Thanks

http://koji.fedoraproject.org/koji/taskinfo?taskID=5523650

Comment 13 Michael S. Tsirkin 2013-06-20 12:00:32 UTC

unable to reproduce, but went over related code again,
and found a bug. It could be this:
http://patchwork.ozlabs.org/patch/252885/

Comment 14 Josh Boyer 2013-06-20 12:46:48 UTC

This scratch build contains the same changes the previous one did, plus the new patch Michael pointed to in comment #13

http://koji.fedoraproject.org/koji/taskinfo?taskID=5523794

Comment 15 Josh Boyer 2013-06-20 12:52:06 UTC

*** Bug 976363 has been marked as a duplicate of this bug. ***

Comment 16 James Boyle 2013-06-20 18:41:46 UTC

(In reply to jason wang from comment #10)
> I could not reproduce the issue, I try both win7 and rhel6 guest. Was the
> issue reproduced 100%? And I need the following information:
> 
> - What's the network configuration (firewall, bonding or else) in host?
> - I notice that NAT were used, does the issue disappear when you use bridge?
> - What does ifconfig $tap looks like after you meet this issue?
> 
> Thanks

In my case, no, it is not a recurring issue.  After I rebooted, I have not experienced the problem again (so far).  I have left my Win7 VM running (along with several other VMs: Solaris 10, OpenBSD 5.3 (x6) & RHEL{5,6}-{32,64}) as I had done before the issue occurred.  

Given that the issue happened around 17:01 on a Sunday, I'm wondering if there's a scheduled job either on the F18 system or the Win7 VM (more likely) that triggered the problem.  

Also, I found it interesting that the virtual network worked with all of the other VMs (BSD/Solaris/RHEL), even when it failed on the Windows VM.  I was able to SSH to those other VMs via the virtual network and shut down those VMs/OSes cleanly.

Comment 17 Josh Boyer 2013-06-21 12:40:26 UTC

*** Bug 976038 has been marked as a duplicate of this bug. ***

Comment 18 Brad Hubbard 2013-06-23 06:06:56 UTC

With 3.9.5-201.fc20

[  408.330642] 
 skb ffff880146adf200frag 0 addr ffffea000436fb00 compound 32768
[  408.330647] frag 1 addr ffffea000436f740 compound 32768
[  408.330649] frag 2 addr ffffea000480c5c0 compound 32768
[  408.330650] frag 3 addr ffffea00043697c0 compound 32768
[  408.330651] frag 4 addr ffffea000436f840 compound 32768
[  408.330652] frag 5 addr ffffea000480c5c0 compound 32768
[  408.330653] frag 6 addr ffffea0004224b40 compound 32768
[  408.330654] frag 7 addr ffffea0004224b80 compound 32768
[  408.330655] frag 8 addr ffffea000480c5c0 compound 32768
[  408.330656] frag 9 addr ffffea0004223dc0 compound 32768
[  408.330658] frag 10 addr ffffea0004223d80 compound 32768
[  408.330659] frag 11 addr ffffea000480c5c0 compound 32768
[  408.330660] frag 12 addr ffffea00044aa800 compound 32768
[  408.330661] frag 13 addr ffffea00044aa7c0 compound 32768
[  408.330662] frag 14 addr ffffea000480c5c0 compound 32768
[  408.330663] frag 15 addr ffffea00044aedc0 compound 32768
[  408.330664] frag 16 addr ffffea00044af0c0 compound 32768
[  408.330684] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  408.330725] IP: [<ffffffffa07424a5>] tun_audit_skb+0x65/0xa0 [tun]
[  408.330757] PGD 0 
[  408.330769] Oops: 0000 [#1] SMP 
[  408.330789] Modules linked in: vhost_net macvtap macvlan hidp ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_CHECKSUM fuse iptable_mangle tun bridge stp llc rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_conntrack nf_conntrack ip6_tables xfs libcrc32c snd_hda_codec_hdmi snd_hda_codec_conexant acpi_cpufreq mperf snd_usb_audio coretemp kvm_intel snd_usbmidi_lib snd_rawmidi snd_hda_intel snd_hda_codec snd_hwdep snd_seq arc4 iwldvm iTCO_wdt mac80211 uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_core kvm snd_seq_device iTCO_vendor_support btusb videodev bluetooth e1000e snd_pcm iwlwifi media microcode snd_page_alloc snd_timer sdhci_pci sdhci lpc_ich i2c_i801 ptp pps_core wmi cfg80211 mmc_core mei thinkpad_acpi
[  408.331197]  snd mfd_core rfkill soundcore uinput dm_crypt i915 crc32_pclmul crc32c_intel ghash_clmulni_intel i2c_algo_bit drm_kms_helper firewire_ohci firewire_core drm crc_itu_t usb_storage i2c_core video
[  408.331298] CPU 1 
[  408.331311] Pid: 3863, comm: vhost-3857 Not tainted 3.9.5-201.fc20.x86_64 #1 LENOVO 4243BQ3/4243BQ3
[  408.331351] RIP: 0010:[<ffffffffa07424a5>]  [<ffffffffa07424a5>] tun_audit_skb+0x65/0xa0 [tun]
[  408.331393] RSP: 0018:ffff880146a47cd8  EFLAGS: 00010246
[  408.331418] RAX: 0000000000000000 RBX: 0000000000000012 RCX: 0000000000000006
[  408.331451] RDX: 0000000000000000 RSI: 0000000000000011 RDI: ffffffffa07460b0
[  408.331483] RBP: ffff880146a47ce8 R08: 3320646e756f706d R09: 0000000000000413
[  408.331515] R10: 3434303030616566 R11: 6f63203063306661 R12: ffff880146adf200
[  408.331547] R13: 000000000000000c R14: ffff880146adf200 R15: ffff880146880800
[  408.331579] FS:  0000000000000000(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
[  408.331615] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  408.331642] CR2: 0000000000000000 CR3: 000000020fb3b000 CR4: 00000000000427e0
[  408.331673] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  408.331705] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  408.331738] Process vhost-3857 (pid: 3863, threadinfo ffff880146a46000, task ffff880151254650)
[  408.331776] Stack:
[  408.331786]  000000000000c162 000000000000c16e ffff880146a47d70 ffffffffa074403c
[  408.331825]  ffff880146adf200 01ffffffa07bd171 0000000000001042 ffff880146a5ddc0
[  408.331863]  0000000000000014 ffff880199a71000 ffff8801ffffffa6 00ff880100000001
[  408.331901] Call Trace:
[  408.331918]  [<ffffffffa074403c>] tun_get_user+0x6ec/0x7e0 [tun]
[  408.331947]  [<ffffffffa0744187>] tun_sendmsg+0x57/0x80 [tun]
[  408.331975]  [<ffffffffa07bea58>] handle_tx+0x1c8/0x640 [vhost_net]
[  408.332006]  [<ffffffffa07bef05>] handle_tx_kick+0x15/0x20 [vhost_net]
[  408.332036]  [<ffffffffa07bb81d>] vhost_worker+0xed/0x190 [vhost_net]
[  408.332067]  [<ffffffffa07bb730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[  408.332102]  [<ffffffff8107fcc0>] kthread+0xc0/0xd0
[  408.332126]  [<ffffffff8107fc00>] ? insert_kthread_work+0x40/0x40
[  408.332156]  [<ffffffff8164f22c>] ret_from_fork+0x7c/0xb0
[  408.332183]  [<ffffffff8107fc00>] ? insert_kthread_work+0x40/0x40
[  408.332210] Code: 00 00 00 80 38 00 74 47 66 0f 1f 44 00 00 48 63 d3 89 de 48 c7 c7 b0 60 74 a0 48 83 c2 03 83 c3 01 48 c1 e2 04 48 8b 14 10 31 c0 <8b> 0a 81 e1 00 c0 00 00 e8 96 a6 ef e0 41 8b 84 24 d8 00 00 00 
[  408.332376] RIP  [<ffffffffa07424a5>] tun_audit_skb+0x65/0xa0 [tun]
[  408.332407]  RSP <ffff880146a47cd8>
[  408.332424] CR2: 0000000000000000
[  408.342024] ---[ end trace 4be7caf4081c8d7f ]---

Comment 19 Brad Hubbard 2013-06-23 06:24:58 UTC

With 3.9.6-301.6.fc19

[  336.562144] Modules linked in: vhost_net macvtap macvlan hidp ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle fuse tun bridge stp llc rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables xt_conntrack nf_conntrack xfs libcrc32c snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_hda_codec snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_hwdep arc4 iwldvm snd_seq mac80211 mperf snd_seq_device coretemp kvm_intel iwlwifi snd_pcm uvcvideo kvm iTCO_wdt iTCO_vendor_support videobuf2_vmalloc btusb cfg80211 mei videobuf2_memops videobuf2_core videodev snd_page_alloc snd_timer microcode sdhci_pci lpc_ich media sdhci mmc_core thinkpad_acpi bluetooth i2c_i801 e1000e mfd_core snd ptp wmi pps_core
[  336.562477]  rfkill soundcore uinput dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel i915 firewire_ohci firewire_core crc_itu_t i2c_algo_bit usb_storage drm_kms_helper drm i2c_core video
[  336.562542] CPU 0 
[  336.562551] Pid: 3892, comm: vhost-3890 Not tainted 3.9.6-301.6.fc19.x86_64 #1 LENOVO 4243BQ3/4243BQ3
[  336.562582] RIP: 0010:[<ffffffff8113caaa>]  [<ffffffff8113caaa>] put_page+0xa/0x60
[  336.562610] RSP: 0018:ffff8801b0711c78  EFLAGS: 00010202
[  336.562626] RAX: ffff8801a537cec0 RBX: 0000000000000012 RCX: 0000000000000003
[  336.562647] RDX: 0000000000000140 RSI: 0000000000000246 RDI: 0000000000000002
[  336.562667] RBP: ffff8801b0711c80 R08: 0000000000000001 R09: 0000000000000010
[  336.562688] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88011d935900
[  336.562708] R13: ffffffffa074afd4 R14: 000000000000000c R15: ffff88011d935900
[  336.562729] FS:  0000000000000000(0000) GS:ffff88021e200000(0000) knlGS:0000000000000000
[  336.562752] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  336.562768] CR2: 0000000000000002 CR3: 0000000150821000 CR4: 00000000000427e0
[  336.562789] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  336.562809] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  336.562829] Process vhost-3890 (pid: 3892, threadinfo ffff8801b0710000, task ffff88015083ddc0)
[  336.562853] Stack:
[  336.562861]  0000000000000012 ffff8801b0711ca0 ffffffff8152cb87 ffff88011d935900
[  336.562886]  ffff88011d935900 ffff8801b0711cb8 ffffffff8152cc1a ffff88015089c800
[  336.562912]  ffff8801b0711ce0 ffffffff8152cce2 ffff88015089c800 000000000000c162
[  336.562937] Call Trace:
[  336.562948]  [<ffffffff8152cb87>] skb_release_data+0x87/0x100
[  336.562965]  [<ffffffff8152cc1a>] __kfree_skb+0x1a/0xb0
[  336.562981]  [<ffffffff8152cce2>] kfree_skb+0x32/0x90
[  336.562999]  [<ffffffffa074afd4>] tun_get_user+0x724/0x810 [tun]
[  336.563018]  [<ffffffffa074b117>] tun_sendmsg+0x57/0x80 [tun]
[  336.563036]  [<ffffffffa07bda78>] handle_tx+0x1c8/0x640 [vhost_net]
[  336.563056]  [<ffffffffa07bdf25>] handle_tx_kick+0x15/0x20 [vhost_net]
[  336.563076]  [<ffffffffa07ba81d>] vhost_worker+0xed/0x190 [vhost_net]
[  336.563096]  [<ffffffffa07ba730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[  336.563119]  [<ffffffff81080200>] kthread+0xc0/0xd0
[  336.563134]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[  336.563153]  [<ffffffff8164ed2c>] ret_from_fork+0x7c/0xb0
[  336.563170]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[  336.563188] Code: 00 00 01 75 f3 e9 b6 fe ff ff be 47 01 00 00 48 c7 c7 6f 51 9d 81 e8 46 02 f2 ff e9 a0 fe ff ff 90 66 66 66 66 90 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[  336.563322] RIP  [<ffffffff8113caaa>] put_page+0xa/0x60
[  336.563340]  RSP <ffff8801b0711c78>
[  336.563350] CR2: 0000000000000002
[  336.571304] ---[ end trace 81ec45b5866408e5 ]---

I notice these in the dmesg output and they seem to "roughly" (I'm afraid I don't have a hell of a lot of time for this at the moment, even on the weekends!) coincide with the period this has been happening.

[  141.324100] virbr0: topology change detected, propagating
[  141.324107] virbr0: port 2(vnet0) entered forwarding state
[  141.324122] virbr0: port 2(vnet0) entered forwarding state
[  141.554272] qemu-system-x86: sending ioctl 5326 to a partition!
[  141.554286] qemu-system-x86: sending ioctl 80200204 to a partition!
[  143.860809] device vnet1 entered promiscuous mode
[  143.865931] virbr0: topology change detected, propagating
[  143.865937] virbr0: port 3(vnet1) entered forwarding state
[  143.865946] virbr0: port 3(vnet1) entered forwarding state
[  143.991838] qemu-system-x86: sending ioctl 5326 to a partition!
[  143.991850] qemu-system-x86: sending ioctl 80200204 to a partition!

Comment 20 Michael S. Tsirkin 2013-06-23 10:20:34 UTC

Very interesting.
Can this be triggered on debug build?
We clearly hit some errors on data path, on debug build
there will be some info printks that tell us what's
causing this.

Comment 21 Brad Hubbard 2013-06-23 10:26:12 UTC

OK, I'll try to install the latest debug kernel tomorrow.

Comment 22 James Boyle 2013-06-24 21:12:32 UTC

So, it turns out I can reproduce the problem by kicking off a Windows backup in the guest Windows 7 VM.  The symptoms are the same as before.  The backup destination is a locally hosted SMB (samba) share on the virtual network.

I have installed the debug kernel, but I'm not sure exactly what else I can/should gather.  Give me some instructions/commands to run and I'll post the info.

--James

# uname -a
Linux mnich.quotient-inc.com 3.9.6-200.fc18.x86_64.debug #1 SMP Thu Jun 13 18:47:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


[ 2990.175623] general protection fault: 0000 [#1] SMP 
[ 2990.176239] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i bnep cxgb3 ebtable_filter ebtables bluetooth mdio libcxgbi ip6table_filter ib_iser ip6_tables rfkill rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel vhost_net tun macvtap snd_hda_codec snd_hwdep acpi_cpufreq iTCO_wdt snd_seq snd_seq_device macvlan snd_pcm raid1 mperf coretemp e1000e tpm_tis iTCO_vendor_support joydev dcdbas tpm i2c_i801
[ 2990.178474]  tpm_bios snd_page_alloc snd_timer mei ptp pps_core lpc_ich mfd_core usblp microcode serio_raw snd kvm_intel soundcore kvm uinput binfmt_misc dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core video
[ 2990.180050] CPU 3 
[ 2990.180058] Pid: 6011, comm: vhost-6009 Not tainted 3.9.6-200.fc18.x86_64.debug #1 Dell Inc. OptiPlex 9010/0M9KCM
[ 2990.181628] RIP: 0010:[<ffffffff81175d71>]  [<ffffffff81175d71>] put_page+0x11/0x60
[ 2990.182433] RSP: 0018:ffff8807b1e69b58  EFLAGS: 00010292
[ 2990.183235] RAX: ffff8807afb37bd0 RBX: cccccccccccccccc RCX: 0000000000000006
[ 2990.184043] RDX: 0000000000000140 RSI: ffff88079def8700 RDI: cccccccccccccccc
[ 2990.184855] RBP: ffff8807b1e69b68 R08: 0000000000000000 R09: 0000000000000000
[ 2990.185674] R10: 0000000000000001 R11: 0000000000000000 R12: ffff880784a94b40
[ 2990.186496] R13: ffffffffa035cb54 R14: 000000000000f57e R15: ffff880784a94b40
[ 2990.187320] FS:  0000000000000000(0000) GS:ffff8807fd400000(0000) knlGS:0000000000000000
[ 2990.188154] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2990.188993] CR2: 000000000a450000 CR3: 000000079fe2d000 CR4: 00000000001427e0
[ 2990.189838] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2990.190685] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2990.191539] Process vhost-6009 (pid: 6011, threadinfo ffff8807b1e68000, task ffff88079def8000)
[ 2990.192446] Stack:
[ 2990.193303]  ffffea0015929040 0000000000000012 ffff8807b1e69b88 ffffffff815dcf8f
[ 2990.194169]  ffff880784a94b40 ffff880784a94b40 ffff8807b1e69ba8 ffffffff815dd35e
[ 2990.195050]  ffffea0015903240 ffff8807a2bccb00 ffff8807b1e69bd8 ffffffff815dc9da
[ 2990.195929] Call Trace:
[ 2990.196799]  [<ffffffff815dcf8f>] skb_release_data+0x8f/0x110
[ 2990.197690]  [<ffffffff815dd35e>] __kfree_skb+0x1e/0xa0
[ 2990.198579]  [<ffffffff815dc9da>] kfree_skb+0x3a/0x170
[ 2990.199463]  [<ffffffffa035cb54>] tun_get_user+0x724/0x860 [tun]
[ 2990.200343]  [<ffffffffa035cced>] tun_sendmsg+0x5d/0x90 [tun]
[ 2990.201223]  [<ffffffffa0364cb4>] handle_tx+0x294/0x6f0 [vhost_net]
[ 2990.202107]  [<ffffffffa0365145>] handle_tx_kick+0x15/0x20 [vhost_net]
[ 2990.203121]  [<ffffffffa0361ad8>] vhost_worker+0xf8/0x1a0 [vhost_net]
[ 2990.204091]  [<ffffffffa03619e0>] ? translate_desc.isra.8+0x270/0x270 [vhost_net]
[ 2990.205032]  [<ffffffff8109626d>] kthread+0xed/0x100
[ 2990.205972]  [<ffffffff81096180>] ? kthread_create_on_node+0x160/0x160
[ 2990.206916]  [<ffffffff8173f02c>] ret_from_fork+0x7c/0xb0
[ 2990.207860]  [<ffffffff81096180>] ? kthread_create_on_node+0x160/0x160
[ 2990.208805] Code: 45 fc 65 48 01 04 25 f0 06 1d 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> f7 07 00 c0 00 00 75 34 8b 47 1c 85 c0 74 1a f0 ff 4b 1c 0f 
[ 2990.209925] RIP  [<ffffffff81175d71>] put_page+0x11/0x60
[ 2990.210962]  RSP <ffff8807b1e69b58>
[ 2990.218669] ---[ end trace da97fde49d42ce05 ]---
[ 3159.494278] virbr0: port 2(vnet0) entered disabled state
[ 3159.496652] device vnet0 left promiscuous mode
[ 3159.496671] virbr0: port 2(vnet0) entered disabled state
[ 3195.260761] virbr0: port 3(vnet1) entered disabled state
[ 3195.261617] device vnet1 left promiscuous mode
[ 3195.261636] virbr0: port 3(vnet1) entered disabled state



Jun 24 16:17:45 mnich libvirtd[1477]: 2013-06-24 20:17:45.261+0000: 1477: error : qemuMonitorIO:607 : internal error End of file from monitor
Jun 24 16:35:48 mnich libvirtd[1477]: 2013-06-24 20:35:48.053+0000: 1524: warning : qemuProcessKill:3903 : Timed out waiting after SIGTERM to process 6009, sending SIGKILL
Jun 24 16:35:53 mnich libvirtd[1477]: 2013-06-24 20:35:53.058+0000: 1524: warning : qemuProcessKill:3935 : Timed out waiting after SIGKILL to process 6009
Jun 24 16:35:53 mnich libvirtd[1477]: 2013-06-24 20:35:53.058+0000: 1524: error : qemuDomainDestroyFlags:1967 : operation failed: failed to kill qemu process with SIGTERM
Jun 24 16:36:04 mnich libvirtd[1477]: 2013-06-24 20:36:04.424+0000: 1477: error : virNetSocketReadWire:1185 : End of file while reading data: Input/output error

Comment 23 Brad Hubbard 2013-06-24 23:46:36 UTC

Debug kernel hasn't really given us any additional information AFAICT although, the fact the debug kernel is not detecting any memory hijinks may tell us something in itself. I'm including my output although with James' comment mine is redundant. We really need a vmcore I think but, with my encrypted /home partition this is problematic for me and probably involved me mounting a new, temporary /home whilst I generate the core. I'll try and get this done anyway.

Is it the influence of the debug kernel that changes it from a null pointer dereference to a GPF?

[  704.771850] Modules linked in: vhost_net macvtap macvlan hidp ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle fuse tun bridge stp llc rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter xt_conntrack nf_conntrack ip6_tables xfs libcrc32c snd_usb_audio snd_hda_codec_hdmi snd_hda_codec_conexant snd_hda_intel snd_usbmidi_lib snd_hda_codec snd_rawmidi mperf snd_hwdep coretemp arc4 iwldvm snd_seq mac80211 uvcvideo kvm_intel kvm snd_seq_device snd_pcm videobuf2_vmalloc videobuf2_memops videobuf2_core snd_page_alloc thinkpad_acpi iTCO_wdt iTCO_vendor_support lpc_ich videodev snd_timer mei iwlwifi media cfg80211 snd i2c_i801 btusb microcode mfd_core e1000e sdhci_pci sdhci mmc_core bluetooth ptp soundcore pps_core
[  704.772158]  rfkill wmi uinput dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel i915 i2c_algo_bit drm_kms_helper firewire_ohci firewire_core usb_storage crc_itu_t drm i2c_core video
[  704.772229] CPU 1 
[  704.772245] Pid: 4307, comm: vhost-4306 Not tainted 3.9.6-301.fc19.x86_64.debug #1 LENOVO 4243BQ3/4243BQ3
[  704.772276] RIP: 0010:[<ffffffff81171bea>]  [<ffffffff81171bea>] put_page+0xa/0x60
[  704.772304] RSP: 0018:ffff8801143bfbd8  EFLAGS: 00010202
[  704.772321] RAX: ffff8801d1718de0 RBX: 0000000000000012 RCX: 0000000000000006
[  704.772341] RDX: 0000000000000140 RSI: ffff88010c152bb0 RDI: cccccccccccccccc
[  704.772362] RBP: ffff8801143bfbe0 R08: 0000000000000001 R09: 0000000000000000
[  704.772383] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8802113d9d40
[  704.772403] R13: ffffffffa07e9a84 R14: 000000000000000c R15: ffff8802113d9d40
[  704.772428] FS:  0000000000000000(0000) GS:ffff880214e00000(0000) knlGS:0000000000000000
[  704.772452] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  704.772469] CR2: 00007fabf03a6000 CR3: 000000010db57000 CR4: 00000000000427e0
[  704.772490] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  704.772511] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  704.772532] Process vhost-4306 (pid: 4307, threadinfo ffff8801143be000, task ffff88010c1524b0)
[  704.772556] Stack:
[  704.772563]  0000000000000012 ffff8801143bfc00 ffffffff815c3067 ffff8802113d9d40
[  704.772590]  ffff8802113d9d40 ffff8801143bfc18 ffffffff815c30fa ffff8801143b4b00
[  704.772619]  ffff8801143bfc40 ffffffff815c31c6 ffff8801143b4b00 000000000000c162
[  704.772645] Call Trace:
[  704.772657]  [<ffffffff815c3067>] skb_release_data+0x87/0x100
[  704.772675]  [<ffffffff815c30fa>] __kfree_skb+0x1a/0xb0
[  704.772691]  [<ffffffff815c31c6>] kfree_skb+0x36/0x160
[  704.772710]  [<ffffffffa07e9a84>] tun_get_user+0x744/0x880 [tun]
[  704.772730]  [<ffffffffa07e9c14>] tun_sendmsg+0x54/0x80 [tun]
[  704.772749]  [<ffffffffa085ead7>] handle_tx+0x1f7/0x6d0 [vhost_net]
[  704.772769]  [<ffffffffa085efe5>] handle_tx_kick+0x15/0x20 [vhost_net]
[  704.772789]  [<ffffffffa085a560>] vhost_worker+0xf0/0x190 [vhost_net]
[  704.772810]  [<ffffffffa085a470>] ? vhost_zerocopy_done_signal+0x20/0x20 [vhost_net]
[  704.772835]  [<ffffffff8109482d>] kthread+0xed/0x100
[  704.772852]  [<ffffffff81094740>] ? insert_kthread_work+0x80/0x80
[  704.772871]  [<ffffffff817241ec>] ret_from_fork+0x7c/0xb0
[  704.772888]  [<ffffffff81094740>] ? insert_kthread_work+0x80/0x80
[  704.772906] Code: e0 ff ff 01 e9 a8 fe ff ff be 47 01 00 00 48 c7 c7 ac f1 9e 81 e8 f7 73 ef ff e9 76 fe ff ff 66 90 66 66 66 66 90 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[  704.773047] RIP  [<ffffffff81171bea>] put_page+0xa/0x60
[  704.773065]  RSP <ffff8801143bfbd8>
[  704.780604] ---[ end trace 400da5f479419cd4 ]---

Comment 24 Brad Hubbard 2013-06-25 11:00:52 UTC

Ahhh, but I tell a lie. It has told us something. In both cases %rdi looks to be a poisoned value so this is likely a use after free of a page pointer. I guess that probably also explains the NPD vs. GPF.

Comment 25 Michael S. Tsirkin 2013-06-25 14:25:18 UTC

there should be a KERN_INFO (level 6) message that should
be displayed before the failure.
Guessing that is masked now.
Can you please enable level 6 messages so we can see if it's there?

Comment 26 Michael S. Tsirkin 2013-06-25 15:23:29 UTC

these two patches should treat the crash on error observed here.

http://patchwork.ozlabs.org/patch/253566/
http://patchwork.ozlabs.org/patch/253563/

but it's still not clear to me what's the source of the error.

Comment 27 James Boyle 2013-06-25 17:48:23 UTC

(In reply to Michael S. Tsirkin from comment #25)
> there should be a KERN_INFO (level 6) message that should
> be displayed before the failure.
> Guessing that is masked now.
> Can you please enable level 6 messages so we can see if it's there?

(In reply to Michael S. Tsirkin from comment #26)
> 
> but it's still not clear to me what's the source of the error.

# uname -a
Linux mnich.quotient-inc.com 3.9.6-200.fc18.x86_64.debug #1 SMP Thu Jun 13 18:47:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

# sysctl -a |grep printk
kernel.printk = 4	4	1	7
# sysctl -w kernel.printk=7
kernel.printk = 7
# sysctl -a |grep printk
kernel.printk = 7	4	1	7

I did the above in hope that it will show the level 6 messages too... (please correct me if need to do something else)

Jun 25 12:28:34 mnich kernel: [  991.278430] device vnet0 left promiscuous mode
Jun 25 12:28:34 mnich kernel: [  991.278465] virbr0: port 2(vnet0) entered disabled state
Jun 25 12:31:20 mnich kernel: [ 1157.192303] virbr0: port 3(vnet1) entered disabled state
Jun 25 12:31:20 mnich kernel: [ 1157.193792] device vnet1 left promiscuous mode
Jun 25 12:31:20 mnich kernel: [ 1157.193844] virbr0: port 3(vnet1) entered disabled state
Jun 25 12:46:25 mnich kernel: [ 2062.829758] general protection fault: 0000 [#1] SMP 
Jun 25 12:46:25 mnich kernel: [ 2062.830373] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle be2iscsi iscsi_boot_sysfs nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bnx2i cnic uio ebtable_filter bnep cxgb4i cxgb4 ebtables bluetooth cxgb3i cxgb3 ip6table_filter rfkill mdio ip6_tables libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep acpi_cpufreq snd_seq mperf snd_seq_device raid1 snd_pcm snd_page_alloc mei snd_timer i2c_i801 iTCO_wdt iTCO_vendor_support coretemp lpc_ich e1000e snd dcdbas ptp pps_core serio_raw mfd_core usblp microcode joydev soundcore tpm_tis tpm tpm_bios vhost_net tun macvtap macvlan kvm_intel kvm uinput binfmt_misc dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core video
Jun 25 12:46:25 mnich kernel: [ 2062.834157] CPU 1 
Jun 25 12:46:25 mnich kernel: [ 2062.834165] Pid: 4151, comm: vhost-4150 Not tainted 3.9.6-200.fc18.x86_64.debug #1 Dell Inc. OptiPlex 9010/0M9KCM
Jun 25 12:46:25 mnich kernel: [ 2062.835705] RIP: 0010:[<ffffffff81175d71>]  [<ffffffff81175d71>] put_page+0x11/0x60
Jun 25 12:46:25 mnich kernel: [ 2062.836481] RSP: 0018:ffff88078594fb58  EFLAGS: 00010292
Jun 25 12:46:25 mnich kernel: [ 2062.837270] RAX: ffff8807aac21470 RBX: cccccccccccccccc RCX: 0000000000000006
Jun 25 12:46:25 mnich kernel: [ 2062.838068] RDX: 0000000000000140 RSI: ffff8807805d8700 RDI: cccccccccccccccc
Jun 25 12:46:25 mnich kernel: [ 2062.838863] RBP: ffff88078594fb68 R08: 0000000000000000 R09: 0000000000000000
Jun 25 12:46:25 mnich kernel: [ 2062.839701] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8807aa3ba640
Jun 25 12:46:25 mnich kernel: [ 2062.840502] R13: ffffffffa016db54 R14: 000000000000f57e R15: ffff8807aa3ba640
Jun 25 12:46:25 mnich kernel: [ 2062.841311] FS:  0000000000000000(0000) GS:ffff8807fd000000(0000) knlGS:0000000000000000
Jun 25 12:46:25 mnich kernel: [ 2062.842130] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Jun 25 12:46:25 mnich kernel: [ 2062.842941] CR2: 0000000006e10000 CR3: 000000079e356000 CR4: 00000000001427f0
Jun 25 12:46:25 mnich kernel: [ 2062.843805] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Jun 25 12:46:25 mnich kernel: [ 2062.844630] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Jun 25 12:46:25 mnich kernel: [ 2062.845451] Process vhost-4150 (pid: 4151, threadinfo ffff88078594e000, task ffff8807805d8000)
Jun 25 12:46:25 mnich kernel: [ 2062.846280] Stack:
Jun 25 12:46:25 mnich kernel: [ 2062.847105]  ffffea0016a56480 0000000000000012 ffff88078594fb88 ffffffff815dcf8f
Jun 25 12:46:25 mnich kernel: [ 2062.847970]  ffff8807aa3ba640 ffff8807aa3ba640 ffff88078594fba8 ffffffff815dd35e
Jun 25 12:46:25 mnich kernel: [ 2062.848817]  ffffea0016a32680 ffff8807a0fe0b00 ffff88078594fbd8 ffffffff815dc9da
Jun 25 12:46:25 mnich kernel: [ 2062.849691] Call Trace:
Jun 25 12:46:25 mnich kernel: [ 2062.850533]  [<ffffffff815dcf8f>] skb_release_data+0x8f/0x110
Jun 25 12:46:25 mnich kernel: [ 2062.851405]  [<ffffffff815dd35e>] __kfree_skb+0x1e/0xa0
Jun 25 12:46:25 mnich kernel: [ 2062.852277]  [<ffffffff815dc9da>] kfree_skb+0x3a/0x170
Jun 25 12:46:25 mnich kernel: [ 2062.853131]  [<ffffffffa016db54>] tun_get_user+0x724/0x860 [tun]
Jun 25 12:46:25 mnich kernel: [ 2062.853993]  [<ffffffffa016dced>] tun_sendmsg+0x5d/0x90 [tun]
Jun 25 12:46:25 mnich kernel: [ 2062.854852]  [<ffffffffa020acb4>] handle_tx+0x294/0x6f0 [vhost_net]
Jun 25 12:46:25 mnich kernel: [ 2062.855711]  [<ffffffffa020b145>] handle_tx_kick+0x15/0x20 [vhost_net]
Jun 25 12:46:25 mnich kernel: [ 2062.856571]  [<ffffffffa0207ad8>] vhost_worker+0xf8/0x1a0 [vhost_net]
Jun 25 12:46:25 mnich kernel: [ 2062.857451]  [<ffffffffa02079e0>] ? translate_desc.isra.8+0x270/0x270 [vhost_net]
Jun 25 12:46:25 mnich kernel: [ 2062.858316]  [<ffffffff8109626d>] kthread+0xed/0x100
Jun 25 12:46:25 mnich kernel: [ 2062.859179]  [<ffffffff81096180>] ? kthread_create_on_node+0x160/0x160
Jun 25 12:46:25 mnich kernel: [ 2062.860066]  [<ffffffff8173f02c>] ret_from_fork+0x7c/0xb0
Jun 25 12:46:25 mnich kernel: [ 2062.860932]  [<ffffffff81096180>] ? kthread_create_on_node+0x160/0x160
Jun 25 12:46:25 mnich kernel: [ 2062.861836] Code: 45 fc 65 48 01 04 25 f0 06 1d 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> f7 07 00 c0 00 00 75 34 8b 47 1c 85 c0 74 1a f0 ff 4b 1c 0f 
Jun 25 12:46:25 mnich kernel: [ 2062.862858] RIP  [<ffffffff81175d71>] put_page+0x11/0x60
Jun 25 12:46:25 mnich kernel: [ 2062.863823]  RSP <ffff88078594fb58>
Jun 25 12:46:25 mnich kernel: [ 2062.870684] ---[ end trace ca7c6b1a2e74659f ]---
Jun 25 12:47:17 mnich kernel: [ 2114.716969] qemu-kvm (6152) used greatest stack depth: 2192 bytes left
Jun 25 12:56:21 mnich kernel: [ 2659.820219] virbr0: port 7(vnet5) entered disabled state
Jun 25 12:56:21 mnich kernel: [ 2659.821036] device vnet5 left promiscuous mode
Jun 25 12:56:21 mnich kernel: [ 2659.821057] virbr0: port 7(vnet5) entered disabled state
Jun 25 12:56:22 mnich kernel: [ 2660.391553] virbr0: port 6(vnet4) entered disabled state
Jun 25 12:56:22 mnich kernel: [ 2660.392103] device vnet4 left promiscuous mode
Jun 25 12:56:22 mnich kernel: [ 2660.392122] virbr0: port 6(vnet4) entered disabled state
Jun 25 12:56:36 mnich kernel: [ 2674.443919] virbr0: port 8(vnet6) entered disabled state
Jun 25 12:56:36 mnich kernel: [ 2674.444784] device vnet6 left promiscuous mode
Jun 25 12:56:36 mnich kernel: [ 2674.444808] virbr0: port 8(vnet6) entered disabled state
Jun 25 12:56:41 mnich kernel: [ 2679.619934] virbr0: port 9(vnet7) entered disabled state
Jun 25 12:56:41 mnich kernel: [ 2679.620901] device vnet7 left promiscuous mode
Jun 25 12:56:41 mnich kernel: [ 2679.620929] virbr0: port 9(vnet7) entered disabled state
Jun 25 12:56:44 mnich kernel: [ 2682.289503] virbr0: port 11(vnet9) entered disabled state
Jun 25 12:56:44 mnich kernel: [ 2682.290488] device vnet9 left promiscuous mode
Jun 25 12:56:44 mnich kernel: [ 2682.290509] virbr0: port 11(vnet9) entered disabled state
Jun 25 12:56:47 mnich kernel: [ 2685.437255] virbr0: port 10(vnet8) entered disabled state
Jun 25 12:56:47 mnich kernel: [ 2685.438005] device vnet8 left promiscuous mode
Jun 25 12:56:47 mnich kernel: [ 2685.438031] virbr0: port 10(vnet8) entered disabled state

After the oops:
kernel.printk = 15	4	1	7


I don't know if it would help you or not, but I was able to get the qemu-kvm process to dump core with 'kill -4'..  Abrt wrote it to /var/tmp/abrt/, but it's over 840M even when compressed with XZ.

Also, here's the ps, before, during, and after the kill/dump:
# ps auxww |grep qemu
qemu      4150 16.1 25.2 11236168 8305748 ?    Sl   12:23   7:50 /usr/bin/qemu-kvm -name mnich-win7 -S -M pc-1.2 -enable-kvm -m 8192 -smp 2,sockets=2,cores=1,threads=1 -uuid aba8fb50-d883-d9ce-3706-51574b8dad78 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mnich-win7.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-ccid,id=ccid0 -drive file=/raid1/vms/mnich-win7.img,if=none,id=drive-virtio-disk0,format=qed,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/raid1/iso/virtio-win-0.1-59.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b7:97:fb,bus=pci.0,addr=0x3 -device ccid-card-emulated,backend=nss-emulated,id=smartcard0,bus=ccid0.0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5910,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
root      8759  0.0  0.0 109180   916 pts/0    S+   13:12   0:00 grep --color=auto qemu

# kill -4 4150

# ps auxww |grep 4150
qemu      4150 16.2 25.2 11236168 8305776 ?    Sl   12:23   7:59 /usr/bin/qemu-kvm -name mnich-win7 -S -M pc-1.2 -enable-kvm -m 8192 -smp 2,sockets=2,cores=1,threads=1 -uuid aba8fb50-d883-d9ce-3706-51574b8dad78 -no-user-config -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/mnich-win7.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=localtime -no-shutdown -boot order=cd,menu=on -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device usb-ccid,id=ccid0 -drive file=/raid1/vms/mnich-win7.img,if=none,id=drive-virtio-disk0,format=qed,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x7,drive=drive-virtio-disk0,id=virtio-disk0 -drive file=/raid1/iso/virtio-win-0.1-59.iso,if=none,id=drive-ide0-0-0,readonly=on,format=raw -device ide-cd,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=25,id=hostnet0,vhost=on,vhostfd=36 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:b7:97:fb,bus=pci.0,addr=0x3 -device ccid-card-emulated,backend=nss-emulated,id=smartcard0,bus=ccid0.0 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -spice port=5910,addr=127.0.0.1,disable-ticketing -vga qxl -global qxl-vga.vram_size=67108864 -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6
root      4157  0.0  0.0      0     0 ?        S    12:23   0:00 [kvm-pit/4150]
root      8784 94.9  0.0  41016  1352 ?        R    13:12   0:10 /usr/libexec/abrt-hook-ccpp 4 0 4150 107 107 1372180374 e
root      8799  0.0  0.0 109180   920 pts/0    S+   13:13   0:00 grep --color=auto 4150

# ps auxww |grep 4150
qemu      4150 15.9  0.0      0     0 ?        D    12:23   8:03 [qemu-kvm]
root      8850  0.0  0.0 109184   984 pts/0    S+   13:14   0:00 grep --color=auto 4150

Comment 28 Michael S. Tsirkin 2013-06-25 17:54:05 UTC

I think you need to dump log with dmesg.

Comment 29 Michael S. Tsirkin 2013-06-25 17:57:34 UTC

also can you pls try this:
for file in /sys/class/net/tun0/statistics/*; do echo $file; cat $file;done

replace tun0 with name of the tun device used for your guest.

Comment 30 James Boyle 2013-06-25 20:12:44 UTC

#28: The syslogs appear to have exactly the same information as dmesg.

#29: (Since I'm not sure which is the tun interface... and I'm assuming it wasn't named '*tun*' and somehow destroyed because of the issue)

# pwd
/sys/class/net
# ls
em1  lo  virbr0  virbr0-nic  vnet10
# for X in v*; do  find ./${X}/statistics -type f -print -exec cat '{}' \;; done

./virbr0/statistics/rx_fifo_errors
0
./virbr0/statistics/collisions
0
./virbr0/statistics/rx_errors
0
./virbr0/statistics/rx_compressed
0
./virbr0/statistics/rx_dropped
0
./virbr0/statistics/tx_packets
4379
./virbr0/statistics/tx_errors
0
./virbr0/statistics/rx_missed_errors
0
./virbr0/statistics/rx_over_errors
0
./virbr0/statistics/tx_carrier_errors
0
./virbr0/statistics/tx_heartbeat_errors
0
./virbr0/statistics/rx_crc_errors
0
./virbr0/statistics/multicast
0
./virbr0/statistics/tx_fifo_errors
0
./virbr0/statistics/tx_aborted_errors
0
./virbr0/statistics/rx_bytes
2835998
./virbr0/statistics/tx_compressed
0
./virbr0/statistics/tx_dropped
0
./virbr0/statistics/rx_packets
3697
./virbr0/statistics/tx_bytes
7798275
./virbr0/statistics/tx_window_errors
0
./virbr0/statistics/rx_frame_errors
0
./virbr0/statistics/rx_length_errors
0
./virbr0-nic/statistics/rx_fifo_errors
0
./virbr0-nic/statistics/collisions
0
./virbr0-nic/statistics/rx_errors
0
./virbr0-nic/statistics/rx_compressed
0
./virbr0-nic/statistics/rx_dropped
0
./virbr0-nic/statistics/tx_packets
0
./virbr0-nic/statistics/tx_errors
0
./virbr0-nic/statistics/rx_missed_errors
0
./virbr0-nic/statistics/rx_over_errors
0
./virbr0-nic/statistics/tx_carrier_errors
0
./virbr0-nic/statistics/tx_heartbeat_errors
0
./virbr0-nic/statistics/rx_crc_errors
0
./virbr0-nic/statistics/multicast
0
./virbr0-nic/statistics/tx_fifo_errors
0
./virbr0-nic/statistics/tx_aborted_errors
0
./virbr0-nic/statistics/rx_bytes
0
./virbr0-nic/statistics/tx_compressed
0
./virbr0-nic/statistics/tx_dropped
0
./virbr0-nic/statistics/rx_packets
0
./virbr0-nic/statistics/tx_bytes
0
./virbr0-nic/statistics/tx_window_errors
0
./virbr0-nic/statistics/rx_frame_errors
0
./virbr0-nic/statistics/rx_length_errors
0
./vnet10/statistics/rx_fifo_errors
0
./vnet10/statistics/collisions
0
./vnet10/statistics/rx_errors
0
./vnet10/statistics/rx_compressed
0
./vnet10/statistics/rx_dropped
1
./vnet10/statistics/tx_packets
1071
./vnet10/statistics/tx_errors
0
./vnet10/statistics/rx_missed_errors
0
./vnet10/statistics/rx_over_errors
0
./vnet10/statistics/tx_carrier_errors
0
./vnet10/statistics/tx_heartbeat_errors
0
./vnet10/statistics/rx_crc_errors
0
./vnet10/statistics/multicast
0
./vnet10/statistics/tx_fifo_errors
0
./vnet10/statistics/tx_aborted_errors
0
./vnet10/statistics/rx_bytes
2275036
./vnet10/statistics/tx_compressed
0
./vnet10/statistics/tx_dropped
5665
./vnet10/statistics/rx_packets
640
./vnet10/statistics/tx_bytes
112819
./vnet10/statistics/tx_window_errors
0
./vnet10/statistics/rx_frame_errors
0
./vnet10/statistics/rx_length_errors
0

# for X in v* ; do ifconfig ${X}; done
virbr0: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet 192.168.100.1  netmask 255.255.255.0  broadcast 192.168.100.255
        ether 52:54:00:ca:49:ef  txqueuelen 0  (Ethernet)
        RX packets 3697  bytes 2835998 (2.7 MiB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 4379  bytes 7798275 (7.4 MiB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

virbr0-nic: flags=4098<BROADCAST,MULTICAST>  mtu 1500
        ether 52:54:00:ca:49:ef  txqueuelen 500  (Ethernet)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

vnet10: flags=4163<UP,BROADCAST,RUNNING,MULTICAST>  mtu 1500
        inet6 fe80::fc54:ff:feb7:97fb  prefixlen 64  scopeid 0x20<link>
        ether fe:54:00:b7:97:fb  txqueuelen 500  (Ethernet)
        RX packets 640  bytes 2275036 (2.1 MiB)
        RX errors 0  dropped 1  overruns 0  frame 0
        TX packets 1071  bytes 112819 (110.1 KiB)
        TX errors 0  dropped 5735 overruns 0  carrier 0  collisions 0

Comment 31 Brad Hubbard 2013-06-25 22:01:05 UTC

(In reply to Michael S. Tsirkin from comment #29)
> also can you pls try this:
> for file in /sys/class/net/tun0/statistics/*; do echo $file; cat $file;done
> 
> replace tun0 with name of the tun device used for your guest.

tun0 was on my system (vpn connection) and should not be relevant to the guest as the guest doesn't use it for transferring data to/from the host.

My output came straight from dmesg so I don't think the message is being masked.

Comment 32 jason wang 2013-06-26 05:18:41 UTC

(In reply to James Boyle from comment #30)
> #28: The syslogs appear to have exactly the same information as dmesg.
> 
> #29: (Since I'm not sure which is the tun interface... and I'm assuming it
> wasn't named '*tun*' and somehow destroyed because of the issue)
> 
> # pwd
> /sys/class/net
> # ls
> em1  lo  virbr0  virbr0-nic  vnet10
> # for X in v*; do  find ./${X}/statistics -type f -print -exec cat '{}' \;;
> done
> 
> ./virbr0/statistics/rx_fifo_errors
> 0
> ./virbr0/statistics/collisions
> 0
> ./virbr0/statistics/rx_errors
> 0
> ./virbr0/statistics/rx_compressed
> 0
> ./virbr0/statistics/rx_dropped
> 0
> ./virbr0/statistics/tx_packets
> 4379
> ./virbr0/statistics/tx_errors
> 0
> ./virbr0/statistics/rx_missed_errors
> 0
> ./virbr0/statistics/rx_over_errors
> 0
> ./virbr0/statistics/tx_carrier_errors
> 0
> ./virbr0/statistics/tx_heartbeat_errors
> 0
> ./virbr0/statistics/rx_crc_errors
> 0
> ./virbr0/statistics/multicast
> 0
> ./virbr0/statistics/tx_fifo_errors
> 0
> ./virbr0/statistics/tx_aborted_errors
> 0
> ./virbr0/statistics/rx_bytes
> 2835998
> ./virbr0/statistics/tx_compressed
> 0
> ./virbr0/statistics/tx_dropped
> 0
> ./virbr0/statistics/rx_packets
> 3697
> ./virbr0/statistics/tx_bytes
> 7798275
> ./virbr0/statistics/tx_window_errors
> 0
> ./virbr0/statistics/rx_frame_errors
> 0
> ./virbr0/statistics/rx_length_errors
> 0
> ./virbr0-nic/statistics/rx_fifo_errors
> 0
> ./virbr0-nic/statistics/collisions
> 0
> ./virbr0-nic/statistics/rx_errors
> 0
> ./virbr0-nic/statistics/rx_compressed
> 0
> ./virbr0-nic/statistics/rx_dropped
> 0
> ./virbr0-nic/statistics/tx_packets
> 0
> ./virbr0-nic/statistics/tx_errors
> 0
> ./virbr0-nic/statistics/rx_missed_errors
> 0
> ./virbr0-nic/statistics/rx_over_errors
> 0
> ./virbr0-nic/statistics/tx_carrier_errors
> 0
> ./virbr0-nic/statistics/tx_heartbeat_errors
> 0
> ./virbr0-nic/statistics/rx_crc_errors
> 0
> ./virbr0-nic/statistics/multicast
> 0
> ./virbr0-nic/statistics/tx_fifo_errors
> 0
> ./virbr0-nic/statistics/tx_aborted_errors
> 0
> ./virbr0-nic/statistics/rx_bytes
> 0
> ./virbr0-nic/statistics/tx_compressed
> 0
> ./virbr0-nic/statistics/tx_dropped
> 0
> ./virbr0-nic/statistics/rx_packets
> 0
> ./virbr0-nic/statistics/tx_bytes
> 0
> ./virbr0-nic/statistics/tx_window_errors
> 0
> ./virbr0-nic/statistics/rx_frame_errors
> 0
> ./virbr0-nic/statistics/rx_length_errors
> 0
> ./vnet10/statistics/rx_fifo_errors
> 0
> ./vnet10/statistics/collisions
> 0
> ./vnet10/statistics/rx_errors
> 0
> ./vnet10/statistics/rx_compressed
> 0
> ./vnet10/statistics/rx_dropped
> 1

So there's one packet dropped in rx. Looks like Michael's path may help here.

Please try this build which contains all patches mentioned above:

http://koji.fedoraproject.org/koji/taskinfo?taskID=5541985

Thanks

Comment 33 James Boyle 2013-06-26 14:44:30 UTC

(In reply to jason wang from comment #32)
> So there's one packet dropped in rx. Looks like Michael's path may help here.
> 
> Please try this build which contains all patches mentioned above:
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5541985
> 
> Thanks

I downloaded the rpms from the link above, installed, and ran a quick test.  
# uname -a
Linux mnich.quotient-inc.com 3.9.6-301.6.fc20.x86_64 #1 SMP Wed Jun 26 04:21:06 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

from dmesg:
[  162.550324] hrtimer: interrupt took 3294 ns
[  866.145123] virbr0: port 5(vnet3) entered disabled state
[  866.146414] device vnet3 left promiscuous mode
[  866.146421] virbr0: port 5(vnet3) entered disabled state
[  901.154890] virbr0: port 4(vnet2) entered disabled state
[  901.155641] device vnet2 left promiscuous mode
[  901.155648] virbr0: port 4(vnet2) entered disabled state
[  939.288118] BUG: unable to handle kernel NULL pointer dereference at           (null)
[  939.288147] IP: [<ffffffff8113caaa>] put_page+0xa/0x60
[  939.288165] PGD 0 
[  939.288173] Oops: 0000 [#1] SMP 
[  939.288185] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat be2iscsi iscsi_boot_sysfs bnx2i iptable_mangle cnic uio cxgb4i cxgb4 nf_conntrack_ipv4 cxgb3i nf_defrag_ipv4 cxgb3 xt_conntrack mdio libcxgbi nf_conntrack bnep ebtable_filter ib_iser ebtables rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core bluetooth ip6table_filter rfkill ip6_tables iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi acpi_cpufreq snd_hda_codec_hdmi mperf coretemp snd_hda_codec_realtek joydev snd_hda_intel snd_hda_codec snd_hwdep e1000e ptp iTCO_wdt iTCO_vendor_support vhost_net microcode snd_seq snd_seq_device pps_core snd_pcm snd_page_alloc snd_timer snd
[  939.288443]  mei dcdbas lpc_ich mfd_core raid1 tpm_tis usblp i2c_i801 soundcore tun macvtap macvlan tpm tpm_bios serio_raw kvm_intel kvm binfmt_misc uinput dm_crypt radeon crc32_pclmul i2c_algo_bit drm_kms_helper crc32c_intel ttm ghash_clmulni_intel drm i2c_core video
[  939.288533] CPU 2 
[  939.288541] Pid: 3192, comm: vhost-3191 Not tainted 3.9.6-301.6.fc20.x86_64 #1 Dell Inc. OptiPlex 9010/0M9KCM
[  939.288567] RIP: 0010:[<ffffffff8113caaa>]  [<ffffffff8113caaa>] put_page+0xa/0x60
[  939.288588] RSP: 0018:ffff8807d18d3c78  EFLAGS: 00010202
[  939.288602] RAX: ffff8807d19776c0 RBX: 0000000000000012 RCX: 0000000000000003
[  939.288620] RDX: 0000000000000140 RSI: 0000000000000246 RDI: 0000000000000000
[  939.288638] RBP: ffff8807d18d3c80 R08: 0000000000000003 R09: 0000000000000010
[  939.288656] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8807ce74aa00
[  939.288674] R13: ffffffffa01f8fd4 R14: 000000000000000c R15: ffff8807ce74aa00
[  939.288693] FS:  0000000000000000(0000) GS:ffff88081dd00000(0000) knlGS:0000000000000000
[  939.288713] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  939.288728] CR2: 0000000000000000 CR3: 00000007d18c8000 CR4: 00000000001427e0
[  939.288746] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  939.288764] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  939.288783] Process vhost-3191 (pid: 3192, threadinfo ffff8807d18d2000, task ffff8807c8a7c650)
[  939.288804] Stack:
[  939.288810]  0000000000000012 ffff8807d18d3ca0 ffffffff8152cb87 ffff8807ce74aa00
[  939.288833]  ffff8807ce74aa00 ffff8807d18d3cb8 ffffffff8152cc1a ffff8807c286c800
[  939.288856]  ffff8807d18d3ce0 ffffffff8152cce2 ffff8807c286c800 000000000000f572
[  939.288878] Call Trace:
[  939.288887]  [<ffffffff8152cb87>] skb_release_data+0x87/0x100
[  939.288903]  [<ffffffff8152cc1a>] __kfree_skb+0x1a/0xb0
[  939.288917]  [<ffffffff8152cce2>] kfree_skb+0x32/0x90
[  939.288934]  [<ffffffffa01f8fd4>] tun_get_user+0x724/0x810 [tun]
[  939.288952]  [<ffffffffa01f9117>] tun_sendmsg+0x57/0x80 [tun]
[  939.288970]  [<ffffffffa02e5a78>] handle_tx+0x1c8/0x640 [vhost_net]
[  939.288988]  [<ffffffffa02e5f25>] handle_tx_kick+0x15/0x20 [vhost_net]
[  939.289007]  [<ffffffffa02e281d>] vhost_worker+0xed/0x190 [vhost_net]
[  939.289025]  [<ffffffffa02e2730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[  939.289045]  [<ffffffff81080200>] kthread+0xc0/0xd0
[  939.289059]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[  939.289076]  [<ffffffff8164ed2c>] ret_from_fork+0x7c/0xb0
[  939.289091]  [<ffffffff81080140>] ? insert_kthread_work+0x40/0x40
[  939.289106] Code: 00 00 01 75 f3 e9 b6 fe ff ff be 47 01 00 00 48 c7 c7 6f 51 9d 81 e8 46 02 f2 ff e9 a0 fe ff ff 90 0f 1f 44 00 00 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[  939.289229] RIP  [<ffffffff8113caaa>] put_page+0xa/0x60
[  939.289244]  RSP <ffff8807d18d3c78>
[  939.289254] CR2: 0000000000000000
[  939.295848] ---[ end trace bab0f053102bcd45 ]---

I'll rerun with the debug kernel later today or tomorrow when I have the time.

--James

Comment 34 Michael S. Tsirkin 2013-06-26 18:34:14 UTC

Since it's reproducible so easily for you, will you be able to try
a special kernel with all kind of instrumentation, to figure out
what's going on here?
If yes I'll put here a patch with this extra debug info.

Comment 35 Brad Hubbard 2013-06-26 21:14:32 UTC

(In reply to Michael S. Tsirkin from comment #34)
> Since it's reproducible so easily for you, will you be able to try
> a special kernel with all kind of instrumentation, to figure out
> what's going on here?
> If yes I'll put here a patch with this extra debug info.

Please do Michael, I'm sure either James or I will be able to test it.

Comment 36 James Boyle 2013-06-27 13:06:28 UTC

(In reply to Michael S. Tsirkin from comment #34)
> Since it's reproducible so easily for you, will you be able to try
> a special kernel with all kind of instrumentation, to figure out
> what's going on here?
> If yes I'll put here a patch with this extra debug info.

Sure, sign me up!

Comment 37 Josh Boyer 2013-07-01 11:58:35 UTC

*** Bug 979644 has been marked as a duplicate of this bug. ***

Comment 38 James Boyle 2013-07-01 14:13:06 UTC

# sysctl -a |grep printk
kernel.printk = 4	4	1	7
kernel.printk_delay = 0
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
# sysctl -w kernel.printk=7
kernel.printk = 7
# sysctl -a |grep printk
kernel.printk = 7	4	1	7
kernel.printk_delay = 0
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10
# uname -a
Linux mnich.quotient-inc.com 3.9.6-301.6.fc20.x86_64.debug #1 SMP Wed Jun 26 03:56:13 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux


[232223.854747] sysctl: The scan_unevictable_pages sysctl/node-interface has been disabled for lack of a legitimate use case.  If you have one, please send an email to linux-mm.
[232361.566934] general protection fault: 0000 [#1] SMP 
[232361.566961] Modules linked in: tcp_lp fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE be2iscsi iscsi_boot_sysfs bnx2i ip6table_mangle cnic uio cxgb4i cxgb4 cxgb3i ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 cxgb3 mdio libcxgbi iptable_nat nf_nat_ipv4 nf_nat ib_iser iptable_mangle bnep rdma_cm ib_addr nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack iw_cm ib_cm ib_sa ib_mad ib_core ebtable_filter bluetooth rfkill ebtables iscsi_tcp ip6table_filter ip6_tables libiscsi_tcp libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel acpi_cpufreq snd_hda_codec snd_hwdep mperf snd_seq snd_seq_device coretemp raid1 snd_pcm iTCO_wdt iTCO_vendor_support snd_page_alloc snd_timer usblp snd e1000e ptp dcdbas soundcore
[232361.567243]  i2c_i801 microcode joydev lpc_ich serio_raw pps_core tpm_tis mei vhost_net mfd_core tpm tun macvtap macvlan tpm_bios kvm_intel kvm uinput binfmt_misc dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core video
[232361.567343] CPU 0 
[232361.567352] Pid: 4315, comm: vhost-4314 Not tainted 3.9.6-301.6.fc20.x86_64.debug #1 Dell Inc. OptiPlex 9010/0M9KCM
[232361.567382] RIP: 0010:[<ffffffff81171bea>]  [<ffffffff81171bea>] put_page+0xa/0x60
[232361.567409] RSP: 0000:ffff88042fef5bd0  EFLAGS: 00010202
[232361.567424] RAX: ffff8807c84457b8 RBX: 0000000000000012 RCX: 0000000000000006
[232361.567445] RDX: 0000000000000140 RSI: ffff8803c9650700 RDI: cccccccccccccccc
[232361.567465] RBP: ffff88042fef5bd8 R08: 0000000000000001 R09: 0000000000000000
[232361.567485] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8807e729db00
[232361.567504] R13: ffffffffa0207b24 R14: 000000000000000c R15: ffff8807e729db00
[232361.567525] FS:  0000000000000000(0000) GS:ffff8807fce00000(0000) knlGS:0000000000000000
[232361.567547] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[232361.567563] CR2: 00000000072f0000 CR3: 00000003b8ba8000 CR4: 00000000001427e0
[232361.567583] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[232361.567603] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[232361.567624] Process vhost-4314 (pid: 4315, threadinfo ffff88042fef4000, task ffff8803c9650000)
[232361.567676] Stack:
[232361.567690]  0000000000000012 ffff88042fef5bf8 ffffffff815c30f7 ffff8807e729db00
[232361.567743]  ffff8807e729db00 ffff88042fef5c10 ffffffff815c318a ffff8803b8e48b00
[232361.567795]  ffff88042fef5c38 ffffffff815c3256 ffff8803b8e48b00 000000000000f572
[232361.567847] Call Trace:
[232361.567868]  [<ffffffff815c30f7>] skb_release_data+0x87/0x100
[232361.567905]  [<ffffffff815c318a>] __kfree_skb+0x1a/0xb0
[232361.567938]  [<ffffffff815c3256>] kfree_skb+0x36/0x160
[232361.567978]  [<ffffffffa0207b24>] tun_get_user+0x7c4/0x900 [tun]
[232361.568020]  [<ffffffffa0207cb4>] tun_sendmsg+0x54/0x80 [tun]
[232361.568062]  [<ffffffffa0218af7>] handle_tx+0x1f7/0x6d0 [vhost_net]
[232361.568105]  [<ffffffffa0219005>] handle_tx_kick+0x15/0x20 [vhost_net]
[232361.568149]  [<ffffffffa0214560>] vhost_worker+0xf0/0x190 [vhost_net]
[232361.568193]  [<ffffffffa0214470>] ? vhost_zerocopy_done_signal+0x20/0x20 [vhost_net]
[232361.568234]  [<ffffffff8109482d>] kthread+0xed/0x100
[232361.568250]  [<ffffffff81094740>] ? insert_kthread_work+0x80/0x80
[232361.568289]  [<ffffffff8172456c>] ret_from_fork+0x7c/0xb0
[232361.568324]  [<ffffffff81094740>] ? insert_kthread_work+0x80/0x80
[232361.568360] Code: e0 ff ff 01 e9 a8 fe ff ff be 47 01 00 00 48 c7 c7 ac f1 9e 81 e8 f7 73 ef ff e9 76 fe ff ff 66 90 0f 1f 44 00 00 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[232361.568622] RIP  [<ffffffff81171bea>] put_page+0xa/0x60
[232361.568657]  RSP <ffff88042fef5bd0>
[232361.582117] ---[ end trace ce5d8fc725094591 ]---

(after the crash)
# sysctl -a |grep printk
kernel.printk = 15	4	1	7
kernel.printk_delay = 0
kernel.printk_ratelimit = 5
kernel.printk_ratelimit_burst = 10

I'm still happy to try the "specially instrumented" kernel :^)

Comment 39 jason wang 2013-07-02 09:44:57 UTC

I am able to hit this issue once when I copy a large file from a win7 guest to remote smb server. But it was not reproduce-able for me since I only see it after about 50 times try. Will continue look at this.

Comment 40 James Boyle 2013-07-02 16:48:04 UTC

Just to ask the question again: would it be helpful for anyone troubleshooting to have a core dump of the qemu-kvm process?

Comment 41 Stephen John Smoogen 2013-07-02 17:01:45 UTC

James. What helped the kernel programmers for at least my part of the pie was to use the rawhide kernel on the host with the extra debugging. This caused it from going from a crash to an oops which pointed where one of the problems might be.

Comment 42 Josh Boyer 2013-07-03 00:18:44 UTC

*** Bug 978785 has been marked as a duplicate of this bug. ***

Comment 43 Stephen Degler 2013-07-03 16:01:02 UTC

FYI

I have this bug and can reproduce on demand. (Win7 64 smb file copy blows up instantly).
If anyone wants me to test I can do so.

Stephen Degler

Comment 44 Michael S. Tsirkin 2013-07-04 09:57:41 UTC

Created attachment 768692 [details]
debugging patch 1

please try the attached patch - it should still crash but report error cause
in dmesg before crash.

Comment 45 Michael S. Tsirkin 2013-07-04 10:01:13 UTC

Created attachment 768693 [details]
debugging patch 2 (obsoletes 1)

this patch can't fix crash but should print a diagnostic
error in dmesg before the crash.

Comment 46 Brad Hubbard 2013-07-07 05:23:39 UTC

Progress. I found some time this weekend and did a mock build locally with the patch from comment 45 and got the following.

$ cat /sys/class/net/vnet1/statistics/rx_dropped
1

[  561.307591] unable to build skb from iovec. zerocopy 1 offset 12 count 20 len 49506
[  561.307619] BUG: unable to handle kernel paging request at 00000099020e001c
[  561.307662] IP: [<ffffffff8113ccaa>] put_page+0xa/0x60
[  561.307692] PGD 0 
[  561.307704] Oops: 0000 [#1] SMP 
[  561.307724] Modules linked in: vhost_net macvtap macvlan hidp ebtable_nat ebtables fuse ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_CHECKSUM iptable_mangle tun bridge stp llc rfcomm bnep ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack ip6table_filter nf_conntrack ip6_tables xfs libcrc32c arc4 iwldvm iTCO_wdt iTCO_vendor_support uvcvideo videobuf2_vmalloc mac80211 videobuf2_memops videobuf2_core snd_hda_codec_hdmi snd_hda_codec_conexant mperf videodev media coretemp kvm_intel snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_hda_intel snd_hda_codec snd_hwdep iwlwifi snd_seq snd_seq_device sdhci_pci sdhci mmc_core snd_pcm cfg80211 btusb kvm bluetooth lpc_ich mfd_core snd_page_alloc snd_timer mei i2c_i801 microcode e1000e thinkpad_acpi ptp pps_core snd soundcore
[  561.308075]  rfkill wmi uinput dm_crypt crc32_pclmul crc32c_intel ghash_clmulni_intel i915 firewire_ohci firewire_core i2c_algo_bit drm_kms_helper crc_itu_t usb_storage drm i2c_core video
[  561.308145] CPU 1 
[  561.308156] Pid: 4305, comm: vhost-4304 Not tainted 3.9.9-302.bz975065.fc19.x86_64 #1 LENOVO 4243BQ3/4243BQ3
[  561.308187] RIP: 0010:[<ffffffff8113ccaa>]  [<ffffffff8113ccaa>] put_page+0xa/0x60
[  561.308215] RSP: 0018:ffff8800b0701c78  EFLAGS: 00010202
[  561.308233] RAX: ffff8801a88e42c0 RBX: 0000000000000012 RCX: 0000000000000003
[  561.308256] RDX: 0000000000000140 RSI: 0000000000000246 RDI: 00000099020e001c
[  561.308278] RBP: ffff8800b0701c80 R08: 0000000020656c62 R09: 000000000000048d
[  561.308300] R10: 0000000000000000 R11: 0000000000000000 R12: ffff880100e2e400
[  561.308324] R13: ffffffffa073b054 R14: 000000000000000c R15: ffff880100e2e400
[  561.308347] FS:  0000000000000000(0000) GS:ffff88021e240000(0000) knlGS:0000000000000000
[  561.308373] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  561.308392] CR2: 00000099020e001c CR3: 00000000b5b98000 CR4: 00000000000427e0
[  561.308414] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  561.308438] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[  561.308470] Process vhost-4304 (pid: 4305, threadinfo ffff8800b0700000, task ffff8802124c1770)
[  561.308497] Stack:
[  561.308505]  0000000000000012 ffff8800b0701ca0 ffffffff8152d0a7 ffff880100e2e400
[  561.308534]  ffff880100e2e400 ffff8800b0701cb8 ffffffff8152d13a ffff8800b07bc800
[  561.308563]  ffff8800b0701ce0 ffffffff8152d202 ffff8800b07bc800 000000000000c162
[  561.308591] Call Trace:
[  561.308603]  [<ffffffff8152d0a7>] skb_release_data+0x87/0x100
[  561.308622]  [<ffffffff8152d13a>] __kfree_skb+0x1a/0xb0
[  561.308640]  [<ffffffff8152d202>] kfree_skb+0x32/0x90
[  561.308659]  [<ffffffffa073b054>] tun_get_user+0x7a4/0x8b0 [tun]
[  561.308680]  [<ffffffffa073b1b7>] tun_sendmsg+0x57/0x80 [tun]
[  561.308701]  [<ffffffffa07b1a78>] handle_tx+0x1c8/0x640 [vhost_net]
[  561.308722]  [<ffffffffa07b1f25>] handle_tx_kick+0x15/0x20 [vhost_net]
[  561.308744]  [<ffffffffa07ae81d>] vhost_worker+0xed/0x190 [vhost_net]
[  561.308767]  [<ffffffffa07ae730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[  561.308794]  [<ffffffff810802a0>] kthread+0xc0/0xd0
[  561.308811]  [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
[  561.308832]  [<ffffffff8164f26c>] ret_from_fork+0x7c/0xb0
[  561.308850]  [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
[  561.308870] Code: 00 00 01 75 f3 e9 b6 fe ff ff be 47 01 00 00 48 c7 c7 4e 53 9d 81 e8 76 00 f2 ff e9 a0 fe ff ff 90 66 66 66 66 90 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
[  561.309015] RIP  [<ffffffff8113ccaa>] put_page+0xa/0x60
[  561.309035]  RSP <ffff8800b0701c78>
[  561.309047] CR2: 00000099020e001c
[  561.318175] ---[ end trace 8acb97eb9b242669 ]---

Comment 47 Richard Chan 2013-07-08 18:23:41 UTC

Just hit this with 3.9.9-302.fc19. Using WIN7 VM to Samba share on the host.

Comment 48 Ian Pilcher 2013-07-09 03:47:23 UTC

*** Bug 982437 has been marked as a duplicate of this bug. ***

Comment 49 jason wang 2013-07-09 09:35:36 UTC

Looks like something wrong during skb linearize. The head part were miscount into data part which lead nr_frags were unsynchronized with i in zerocopy_sg_from_iovec().

Will provide a patch for test.

Comment 50 jason wang 2013-07-09 09:37:42 UTC

Created attachment 770879 [details]
Patch to fix the linear part len.

Comment 51 James Boyle 2013-07-09 13:56:10 UTC

(In reply to Michael S. Tsirkin from comment #45)
> Created attachment 768693 [details]
> debugging patch 2 (obsoletes 1)
> 
> this patch can't fix crash but should print a diagnostic
> error in dmesg before the crash.

Output from kernel built with your patch.  (took a while because it's been a while since I've had to build from source + have a day job ;^)   )




[ 2681.177066] unable to build skb from iovec. zerocopy 1 offset 12 count 21 len 62834
[ 2681.177071] ------------[ cut here ]------------
[ 2681.177076] WARNING: at include/linux/mm.h:280 put_page+0x4c/0x60()
[ 2681.177077] Hardware name: OptiPlex 9010
[ 2681.177078] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE be2iscsi iscsi_boot_sysfs bnx2i ip6table_mangle cnic uio cxgb4i cxgb4 ip6t_REJECT cxgb3i nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat cxgb3 mdio iptable_mangle nf_conntrack_ipv4 libcxgbi nf_defrag_ipv4 ib_iser xt_conntrack nf_conntrack bnep rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core ebtable_filter iscsi_tcp ebtables ip6table_filter bluetooth ip6_tables libiscsi_tcp rfkill libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq acpi_cpufreq raid1 mperf snd_seq_device snd_pcm vhost_net coretemp tun macvtap macvlan snd_page_alloc kvm_intel kvm snd_timer iTCO_wdt joydev dcdbas tpm_tis
[ 2681.177108]  microcode e1000e iTCO_vendor_support snd cdc_acm i2c_i801 mei usblp ptp pps_core lpc_ich mfd_core uinput soundcore serio_raw tpm tpm_bios binfmt_misc dm_crypt usb_storage radeon crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper ghash_clmulni_intel ttm drm i2c_core video
[ 2681.177122] Pid: 4105, comm: vhost-4104 Not tainted 3.9.6-200.jwb.975065.fc18.x86_64 #1
[ 2681.177123] Call Trace:
[ 2681.177127]  [<ffffffff8105ef85>] warn_slowpath_common+0x75/0xa0
[ 2681.177132]  [<ffffffffa0255bcf>] ? tun_get_user+0x73f/0x840 [tun]
[ 2681.177134]  [<ffffffff8105efca>] warn_slowpath_null+0x1a/0x20
[ 2681.177135]  [<ffffffff811418cc>] put_page+0x4c/0x60
[ 2681.177138]  [<ffffffff8154612f>] skb_release_data+0x8f/0x110
[ 2681.177140]  [<ffffffff815461ce>] __kfree_skb+0x1e/0xa0
[ 2681.177142]  [<ffffffff81546286>] kfree_skb+0x36/0xa0
[ 2681.177144]  [<ffffffffa0255bcf>] tun_get_user+0x73f/0x840 [tun]
[ 2681.177147]  [<ffffffffa0255d27>] tun_sendmsg+0x57/0x80 [tun]
[ 2681.177150]  [<ffffffffa0319d87>] handle_tx+0x287/0x680 [vhost_net]
[ 2681.177153]  [<ffffffffa031a1b5>] handle_tx_kick+0x15/0x20 [vhost_net]
[ 2681.177156]  [<ffffffffa031695d>] vhost_worker+0xed/0x190 [vhost_net]
[ 2681.177159]  [<ffffffffa0316870>] ? vhost_work_flush+0x110/0x110 [vhost_net]
[ 2681.177162]  [<ffffffff81082af0>] kthread+0xc0/0xd0
[ 2681.177165]  [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_flush+0x20/0xb0
[ 2681.177167]  [<ffffffff81082a30>] ? kthread_create_on_node+0x120/0x120
[ 2681.177169]  [<ffffffff8166a52c>] ret_from_fork+0x7c/0xb0
[ 2681.177171]  [<ffffffff81082a30>] ? kthread_create_on_node+0x120/0x120
[ 2681.177172] ---[ end trace fb59906a52dcec67 ]---
[ 2681.177179] BUG: unable to handle kernel NULL pointer dereference at           (null)
[ 2681.177209] IP: [<ffffffff81141891>] put_page+0x11/0x60
[ 2681.177227] PGD 0 
[ 2681.177235] Oops: 0000 [#1] SMP 
[ 2681.177248] Modules linked in: fuse ebtable_nat xt_CHECKSUM bridge stp llc nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE be2iscsi iscsi_boot_sysfs bnx2i ip6table_mangle cnic uio cxgb4i cxgb4 ip6t_REJECT cxgb3i nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat cxgb3 mdio iptable_mangle nf_conntrack_ipv4 libcxgbi nf_defrag_ipv4 ib_iser xt_conntrack nf_conntrack bnep rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core ebtable_filter iscsi_tcp ebtables ip6table_filter bluetooth ip6_tables libiscsi_tcp rfkill libiscsi scsi_transport_iscsi snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec snd_hwdep snd_seq acpi_cpufreq raid1 mperf snd_seq_device snd_pcm vhost_net coretemp tun macvtap macvlan snd_page_alloc kvm_intel kvm snd_timer iTCO_wdt joydev dcdbas tpm_tis
[ 2681.177808]  microcode e1000e iTCO_vendor_support snd cdc_acm i2c_i801 mei usblp ptp pps_core lpc_ich mfd_core uinput soundcore serio_raw tpm tpm_bios binfmt_misc dm_crypt usb_storage radeon crc32_pclmul crc32c_intel i2c_algo_bit drm_kms_helper ghash_clmulni_intel ttm drm i2c_core video
[ 2681.178020] CPU 3 
[ 2681.178037] Pid: 4105, comm: vhost-4104 Tainted: G        W    3.9.6-200.jwb.975065.fc18.x86_64 #1 Dell Inc. OptiPlex 9010/0M9KCM
[ 2681.178110] RIP: 0010:[<ffffffff81141891>]  [<ffffffff81141891>] put_page+0x11/0x60
[ 2681.178162] RSP: 0018:ffff8807ce3cdbf8  EFLAGS: 00010292
[ 2681.178196] RAX: ffff8807b91e20c0 RBX: 0000000000000000 RCX: 0000000000000006
[ 2681.178241] RDX: 0000000000000150 RSI: 0000000000000000 RDI: 0000000000000000
[ 2681.178286] RBP: ffff8807ce3cdc08 R08: 000000000000000a R09: 0000000000000481
[ 2681.178330] R10: 0000000000000000 R11: 0000000000000480 R12: ffff8807d6626d00
[ 2681.178374] R13: ffffffffa0255bcf R14: ffff8807d6626d00 R15: ffff8807d59f8800
[ 2681.178419] FS:  0000000000000000(0000) GS:ffff88081dd80000(0000) knlGS:0000000000000000
[ 2681.178470] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2681.178506] CR2: 0000000000000000 CR3: 00000007c169e000 CR4: 00000000001427e0
[ 2681.178550] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 2681.178595] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 2681.178640] Process vhost-4104 (pid: 4105, threadinfo ffff8807ce3cc000, task ffff8807916addc0)
[ 2681.178693] Stack:
[ 2681.178708]  000000000000f572 0000000000000013 ffff8807ce3cdc28 ffffffff8154612f
[ 2681.178762]  ffff8807d6626d00 ffff8807d6626d00 ffff8807ce3cdc48 ffffffff815461ce
[ 2681.178816]  ffff8807ce3cdc88 ffff8807ba8b43d8 ffff8807ce3cdc78 ffffffff81546286
[ 2681.178870] Call Trace:
[ 2681.178889]  [<ffffffff8154612f>] skb_release_data+0x8f/0x110
[ 2681.178927]  [<ffffffff815461ce>] __kfree_skb+0x1e/0xa0
[ 2681.178962]  [<ffffffff81546286>] kfree_skb+0x36/0xa0
[ 2681.178999]  [<ffffffffa0255bcf>] tun_get_user+0x73f/0x840 [tun]
[ 2681.179042]  [<ffffffffa0255d27>] tun_sendmsg+0x57/0x80 [tun]
[ 2681.179082]  [<ffffffffa0319d87>] handle_tx+0x287/0x680 [vhost_net]
[ 2681.179126]  [<ffffffffa031a1b5>] handle_tx_kick+0x15/0x20 [vhost_net]
[ 2681.179172]  [<ffffffffa031695d>] vhost_worker+0xed/0x190 [vhost_net]
[ 2681.179216]  [<ffffffffa0316870>] ? vhost_work_flush+0x110/0x110 [vhost_net]
[ 2681.179263]  [<ffffffff81082af0>] kthread+0xc0/0xd0
[ 2681.179296]  [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_flush+0x20/0xb0
[ 2681.179344]  [<ffffffff81082a30>] ? kthread_create_on_node+0x120/0x120
[ 2681.179387]  [<ffffffff8166a52c>] ret_from_fork+0x7c/0xb0
[ 2681.179423]  [<ffffffff81082a30>] ? kthread_create_on_node+0x120/0x120
[ 2681.179464] Code: 45 fc 65 48 01 04 25 70 02 01 00 c9 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 53 48 89 fb 48 83 ec 08 <48> f7 07 00 c0 00 00 75 34 8b 47 1c 85 c0 74 1a f0 ff 4b 1c 0f 
[ 2681.179708] RIP  [<ffffffff81141891>] put_page+0x11/0x60
[ 2681.179745]  RSP <ffff8807ce3cdbf8>
[ 2681.179768] CR2: 0000000000000000
[ 2681.195958] ---[ end trace fb59906a52dcec68 ]---

Comment 52 Brad Hubbard 2013-07-09 21:40:05 UTC

(In reply to jason wang from comment #50)
> Created attachment 770879 [details]
> Patch to fix the linear part len.

I'll try to build a kernel to test this tonight.

Comment 53 Michael S. Tsirkin 2013-07-10 06:15:49 UTC

okay, these two should fix this crash:
http://patchwork.ozlabs.org/patch/257972/
http://patchwork.ozlabs.org/patch/257973/

Comment 54 James Boyle 2013-07-10 19:09:57 UTC

(In reply to Michael S. Tsirkin from comment #53)
> okay, these two should fix this crash:
> http://patchwork.ozlabs.org/patch/257972/
> http://patchwork.ozlabs.org/patch/257973/

I've built with those two patches; it didn't OOPS or crash, but the virtio interface wedged on the Win7 guest (no functionality after the skb error), and I still got this message:

Jul 10 13:59:58 mnich kernel: [ 1045.529209] unable to build skb from iovec. zerocopy 1 offset 12 count 20 len 62834

Comment 55 Michael S. Tsirkin 2013-07-11 09:08:53 UTC

Thanks for the report. Yes I see how this can happen.
With crashes finally behind us, looking at the proper fix
for the error.

Comment 56 Satish Balay 2013-07-16 18:47:42 UTC

Perhaps I'm seeing this issue aswell.

I see this trace on the console of a f19 machine [upgraded from F18] running a few VMs with network set to bridge mode. 

The machine and VMs are still useable - so the machine didn't crash.[my /var/log/messages is empty - but I'm able to grab stuff from dmesg.]

[root@maverick ~]# uname -a
Linux maverick.mcs.anl.gov 3.9.9-302.fc19.x86_64 #1 SMP Sat Jul 6 13:41:07 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[root@maverick ~]# uptime
 13:29:43 up 7 days, 12 min,  1 user,  load average: 1.97, 2.15, 2.13
[root@maverick ~]# 



[502486.657719] BUG: unable to handle kernel NULL pointer dereference at 000000000000001c
[502486.657762] IP: [<ffffffff8113cada>] put_compound_page+0xaa/0x270
[502486.657792] PGD 0 
[502486.657802] Oops: 0000 [#1] SMP 
[502486.657818] Modules linked in: vhost_net macvtap macvlan ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_CHECKSUM iptable_mangle tun bnep bluetooth bridge stp llc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack ip6table_filter ip6_tables nf_conntrack snd_hda_codec_analog iTCO_wdt hp_wmi mei sparse_keymap iTCO_vendor_support rfkill ppdev snd_hda_intel snd_hda_codec acpi_cpufreq snd_hwdep e1000e snd_seq snd_seq_device mperf coretemp kvm_intel snd_pcm snd_page_alloc snd_timer snd ptp kvm lpc_ich pps_core mfd_core soundcore microcode tpm_tis parport_pc parport wmi tpm_infineon tpm tpm_bios nfsd auth_rpcgss nfs_acl lockd sunrpc binfmt_misc uinput i915 i2c_algo_bit drm_kms_helper drm i2c_core video
[502486.658003] CPU 1 
[502486.658003] Pid: 1173, comm: vhost-1172 Not tainted 3.9.9-302.fc19.x86_64 #1 Hewlett-Packard HP Compaq dc5800 Small Form Factor/2820h
[502486.658003] RIP: 0010:[<ffffffff8113cada>]  [<ffffffff8113cada>] put_compound_page+0xaa/0x270
[502486.658003] RSP: 0018:ffff880164919c50  EFLAGS: 00010286
[502486.658003] RAX: ffff88004e46f800 RBX: ffff88004e46f600 RCX: 000000000000000f
[502486.658003] RDX: 0000000000000140 RSI: 0000000000000246 RDI: ffff88004e46f600
[502486.658003] RBP: ffff880164919c68 R08: 0000000000000304 R09: 0000000000000010
[502486.658003] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[502486.658003] R13: ffffffffa04befd4 R14: 000000000000000c R15: ffff880166a6c000
[502486.658003] FS:  0000000000000000(0000) GS:ffff88019bc80000(0000) knlGS:0000000000000000
[502486.658003] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[502486.658003] CR2: 000000000000001c CR3: 00000001649b8000 CR4: 00000000000027f0
[502486.658003] DR0: 0000000000000001 DR1: 0000000000000002 DR2: 0000000000000001
[502486.658003] DR3: 000000000000000a DR6: 00000000ffff0ff0 DR7: 0000000000000400
[502486.658003] Process vhost-1172 (pid: 1173, threadinfo ffff880164918000, task ffff88017ea24650)
[502486.658003] Stack:
[502486.658003]  ffff88004e46f600 ffff880166a6c000 ffffffffa04befd4 ffff880164919c80
[502486.658003]  ffffffff8113cceb 0000000000000012 ffff880164919ca0 ffffffff8152d0a7
[502486.658003]  ffff880166a6c000 ffff880166a6c000 ffff880164919cb8 ffffffff8152d13a
[502486.658003] Call Trace:
[502486.658003]  [<ffffffffa04befd4>] ? tun_get_user+0x724/0x810 [tun]
[502486.658003]  [<ffffffff8113cceb>] put_page+0x4b/0x60
[502486.658003]  [<ffffffff8152d0a7>] skb_release_data+0x87/0x100
[502486.658003]  [<ffffffff8152d13a>] __kfree_skb+0x1a/0xb0
[502486.658003]  [<ffffffff8152d202>] kfree_skb+0x32/0x90
[502486.658003]  [<ffffffffa04befd4>] tun_get_user+0x724/0x810 [tun]
[502486.658003]  [<ffffffffa04bf117>] tun_sendmsg+0x57/0x80 [tun]
[502486.658003]  [<ffffffffa050fa78>] handle_tx+0x1c8/0x640 [vhost_net]
[502486.658003]  [<ffffffffa050ff05>] handle_tx_net+0x15/0x20 [vhost_net]
[502486.658003]  [<ffffffffa050c81d>] vhost_worker+0xed/0x190 [vhost_net]
[502486.658003]  [<ffffffffa050c730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
[502486.658003]  [<ffffffff810802a0>] kthread+0xc0/0xd0
[502486.658003]  [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
[502486.658003]  [<ffffffff8164f26c>] ret_from_fork+0x7c/0xb0
[502486.658003]  [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
[502486.658003] Code: 48 c7 c7 27 08 9f 81 e8 55 02 f2 ff 48 89 df e8 1d f7 ff ff 85 c0 74 a4 eb 9a 4c 8b 67 30 48 8b 07 f6 c4 80 74 cd 4c 39 e7 74 c8 <41> 8b 54 24 1c 85 d2 74 bf 8d 4a 01 89 d0 f0 41 0f b1 4c 24 1c 
[502486.658003] RIP  [<ffffffff8113cada>] put_compound_page+0xaa/0x270
[502486.658003]  RSP <ffff880164919c50>
[502486.658003] CR2: 000000000000001c
[502486.789205] ---[ end trace 62bd766243ff78d0 ]---

Comment 57 Satish Balay 2013-07-16 18:58:03 UTC

(In reply to Satish Balay from comment #56)

> The machine and VMs are still useable - so the machine didn't crash.

I mis-spoke. One of the VMs [bsd with virtio for network/disk] is not responding.

Comment 58 Brad Hubbard 2013-07-16 21:27:18 UTC

(In reply to Satish Balay from comment #57)
> (In reply to Satish Balay from comment #56)
> 
> > The machine and VMs are still useable - so the machine didn't crash.
> 
> I mis-spoke. One of the VMs [bsd with virtio for network/disk] is not
> responding.

Yep, that fits the profile.

Comment 59 Izhar Firdaus 2013-07-17 02:49:56 UTC

happening with freebsd 9 guest here, repeatably when trying to run ezjail-admin update -b -p ..

Comment 60 Izhar Firdaus 2013-07-17 02:53:28 UTC

Created attachment 774591 [details]
/var/log/message of failure with Freebsd 9.1 guest with virtio blk/net , running ezjail-admin update -b -p

Comment 61 James Boyle 2013-07-18 15:53:41 UTC

Created attachment 775369 [details]
updated-patches-3.9.9

I updated to the 3.9.9 kernel (f18) and rebuilt with the attached.  (I applied the last hunk manually, just wanted to keep it to one attachment)

$ uname -a
Linux mnich.quotient-inc.com 3.9.9-201.bz975065.jwb.fc18.x86_64 #1 SMP Wed Jul 17 10:56:19 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Jul 18 10:07:55 mnich kernel: [57896.472550] unable to build skb from iovec. zerocopy 1 offset 12 count 20 len 62834


So, nothing new, I just wanted to check on the newer kernel.  It looks the same.

Comment 62 jason wang 2013-07-19 02:47:09 UTC

Those should fix this issue:

http://patchwork.ozlabs.org/patch/259997/
http://patchwork.ozlabs.org/patch/259996/

Please test those with patches mentioned in comment #53

Thanks

Comment 63 James Boyle 2013-07-19 14:34:55 UTC

(In reply to jason wang from comment #62)
> Those should fix this issue:
> 
> http://patchwork.ozlabs.org/patch/259997/
> http://patchwork.ozlabs.org/patch/259996/
> 
> Please test those with patches mentioned in comment #53
> 
> Thanks

$ uname -a
Linux mnich.quotient-inc.com 3.9.9-201.bz975065.jwb.2.fc18.x86_64 #1 SMP Thu Jul 18 23:42:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux

Jul 19 09:44:27 mnich kernel: [ 1022.788596] unable to build skb from iovec. zerocopy 1 offset 12 count 19 len 62834
Jul 19 09:44:46 mnich kernel: [ 1041.745407] unable to build skb from iovec. zerocopy 1 offset 12 count 19 len 62834


I rebuilt with the patches indicated; The VirtIO interface took longer to seize this time, and I got two skb messages instead of one.

Comment 64 jason wang 2013-07-23 07:13:32 UTC

(In reply to James Boyle from comment #63)
> (In reply to jason wang from comment #62)
> > Those should fix this issue:
> > 
> > http://patchwork.ozlabs.org/patch/259997/
> > http://patchwork.ozlabs.org/patch/259996/
> > 
> > Please test those with patches mentioned in comment #53
> > 
> > Thanks
> 
> $ uname -a
> Linux mnich.quotient-inc.com 3.9.9-201.bz975065.jwb.2.fc18.x86_64 #1 SMP Thu
> Jul 18 23:42:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
> 
> Jul 19 09:44:27 mnich kernel: [ 1022.788596] unable to build skb from iovec.
> zerocopy 1 offset 12 count 19 len 62834
> Jul 19 09:44:46 mnich kernel: [ 1041.745407] unable to build skb from iovec.
> zerocopy 1 offset 12 count 19 len 62834
> 
> 
> I rebuilt with the patches indicated; The VirtIO interface took longer to
> seize this time, and I got two skb messages instead of one.

Thanks a lot for the testing. Does the network still usable after hitting this message.

btw, I didn't hit this error after about 2 hours testing.

Comment 65 Rik van Riel 2013-07-23 16:38:36 UTC

This patch looks like it might resolve the remaining issue:

http://patchwork.ozlabs.org/patch/252885/

Comment 66 James Boyle 2013-07-23 16:43:31 UTC

(In reply to jason wang from comment #64)
> (In reply to James Boyle from comment #63)
> > (In reply to jason wang from comment #62)
> > > Those should fix this issue:
> > > 
> > > http://patchwork.ozlabs.org/patch/259997/
> > > http://patchwork.ozlabs.org/patch/259996/
> > > 
> > > Please test those with patches mentioned in comment #53
> > > 
> > > Thanks
> > 
> > $ uname -a
> > Linux mnich.quotient-inc.com 3.9.9-201.bz975065.jwb.2.fc18.x86_64 #1 SMP Thu
> > Jul 18 23:42:21 EDT 2013 x86_64 x86_64 x86_64 GNU/Linux
> > 
> > Jul 19 09:44:27 mnich kernel: [ 1022.788596] unable to build skb from iovec.
> > zerocopy 1 offset 12 count 19 len 62834
> > Jul 19 09:44:46 mnich kernel: [ 1041.745407] unable to build skb from iovec.
> > zerocopy 1 offset 12 count 19 len 62834
> > 
> > 
> > I rebuilt with the patches indicated; The VirtIO interface took longer to
> > seize this time, and I got two skb messages instead of one.
> 
> Thanks a lot for the testing. Does the network still usable after hitting
> this message.
> 
> btw, I didn't hit this error after about 2 hours testing.

I'm going by memory at this point, but I think virtio networking worked briefly after the first skb message.  I'll test it again to confirm, with ping and added timestamps...

--James

Oh!  I was just about to submit the above when BZ showed Rik's message.  I'll try that patch as well.

Comment 67 Josh Boyer 2013-07-23 17:09:15 UTC

(In reply to Rik van Riel from comment #65)
> This patch looks like it might resolve the remaining issue:
> 
> http://patchwork.ozlabs.org/patch/252885/

F18:

[jwboyer@zod kernel]$ grep use-after kernel.spec
Patch25062: vhost-net-fix-use-after-free-in-vhost_net_flush.patch
ApplyPatch vhost-net-fix-use-after-free-in-vhost_net_flush.patch

F19:

[jwboyer@zod kernel]$ grep use-after kernel.spec
Patch25062: vhost-net-fix-use-after-free-in-vhost_net_flush.patch
ApplyPatch vhost-net-fix-use-after-free-in-vhost_net_flush.patch


We've been carrying that since Jul 5.

Comment 68 jason wang 2013-07-24 04:22:12 UTC

Hi Josh:

Please try this kernel which has more verbose debug info that may help analysis the issue. Thanks a lot.

http://koji.fedoraproject.org/koji/taskinfo?taskID=5643506

Comment 69 James Boyle 2013-07-25 15:34:12 UTC

(In reply to jason wang from comment #68)
> Hi Josh:
> 
> Please try this kernel which has more verbose debug info that may help
> analysis the issue. Thanks a lot.
> 
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5643506

I'm unable to reproduce the issue with the provided kernel (3.10.2-301.fc20.x86_64) & associated packages.  :^)

Thanks!
--James

Comment 70 Brad Hubbard 2013-07-25 21:14:28 UTC

(In reply to James Boyle from comment #69)
> I'm unable to reproduce the issue with the provided kernel
> (3.10.2-301.fc20.x86_64) & associated packages.  :^)
> 
> Thanks!
> --James

Thanks for all your testing James, it's appreciated.

Comment 71 James Boyle 2013-07-26 15:22:47 UTC

I'm glad that I could help :^)
--James

Comment 72 zbechir 2013-07-27 16:21:42 UTC

Description of problem:
Travailler avec Virt Manager en executant la commande mkfs sur une partition iscsi

Version-Release number of selected component:
kernel

Additional info:
reporter:       libreport-2.1.5
cmdline:        BOOT_IMAGE=/vmlinuz-3.9.9-302.fc19.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/swap rd.md=0 rd.dm=0 rd.lvm.lv=fedora/usr vconsole.keymap=fr rd.luks=0 rd.lvm.lv=fedora/root rhgb quiet LANG=fr_FR.utf8
kernel:         3.9.9-302.fc19.x86_64
runlevel:       N 5
type:           Kerneloops

Truncated backtrace:
BUG: unable to handle kernel NULL pointer dereference at 000000000000037f
IP: [<ffffffff8113ccaa>] put_page+0xa/0x60
PGD 0 
Oops: 0000 [#1] SMP 
Modules linked in: rfcomm fuse vhost_net macvtap macvlan ip6table_filter ip6_tables ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack tun bridge stp llc bnep bluetooth rfkill acpi_cpufreq mperf coretemp kvm_intel kvm snd_hda_codec_hdmi crc32_pclmul snd_hda_codec_realtek crc32c_intel ghash_clmulni_intel snd_hda_intel snd_hda_codec microcode iTCO_wdt iTCO_vendor_support snd_hwdep snd_seq snd_seq_device snd_pcm snd_page_alloc snd_timer gspca_vc032x gspca_main videodev media r8169 i2c_i801 serio_raw intel_ips snd mii soundcore lpc_ich mfd_core binfmt_misc uinput ata_generic pata_acpi i915 video i2c_algo_bit drm_kms_helper drm pata_jmicron i2c_core
CPU 3 
Pid: 1541, comm: vhost-1539 Not tainted 3.9.9-302.fc19.x86_64 #1 BIOSTAR Group TH55B HD/TH55B HD
RIP: 0010:[<ffffffff8113ccaa>]  [<ffffffff8113ccaa>] put_page+0xa/0x60
RSP: 0018:ffff8802124bbc78  EFLAGS: 00010202
RAX: ffff880186f33cc0 RBX: 0000000000000012 RCX: 0000000000000003
RDX: 0000000000000140 RSI: 0000000000000246 RDI: 000000000000037f
RBP: ffff8802124bbc80 R08: 0000000000000001 R09: 0000000000000010
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8802103dea00
R13: ffffffffa0398fd4 R14: 000000000000000c R15: ffff8802103dea00
FS:  0000000000000000(0000) GS:ffff88023fcc0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 000000000000037f CR3: 0000000206110000 CR4: 00000000000027e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process vhost-1539 (pid: 1541, threadinfo ffff8802124ba000, task ffff880233cc8000)
Stack:
 0000000000000012 ffff8802124bbca0 ffffffff8152d0a7 ffff8802103dea00
 ffff8802103dea00 ffff8802124bbcb8 ffffffff8152d13a ffff880212468800
 ffff8802124bbce0 ffffffff8152d202 ffff880212468800 000000000000c162
Call Trace:
 [<ffffffff8152d0a7>] skb_release_data+0x87/0x100
 [<ffffffff8152d13a>] __kfree_skb+0x1a/0xb0
 [<ffffffff8152d202>] kfree_skb+0x32/0x90
 [<ffffffffa0398fd4>] tun_get_user+0x724/0x810 [tun]
 [<ffffffffa0399117>] tun_sendmsg+0x57/0x80 [tun]
 [<ffffffffa0403a78>] handle_tx+0x1c8/0x640 [vhost_net]
 [<ffffffffa0403f25>] handle_tx_kick+0x15/0x20 [vhost_net]
 [<ffffffffa040081d>] vhost_worker+0xed/0x190 [vhost_net]
 [<ffffffffa0400730>] ? __vhost_add_used_n+0x100/0x100 [vhost_net]
 [<ffffffff810802a0>] kthread+0xc0/0xd0
 [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
 [<ffffffff8164f26c>] ret_from_fork+0x7c/0xb0
 [<ffffffff810801e0>] ? insert_kthread_work+0x40/0x40
Code: 00 00 01 75 f3 e9 b6 fe ff ff be 47 01 00 00 48 c7 c7 47 53 9d 81 e8 76 00 f2 ff e9 a0 fe ff ff 90 66 66 66 66 90 55 48 89 e5 53 <48> f7 07 00 c0 00 00 48 89 fb 75 30 8b 47 1c 85 c0 74 16 f0 ff 
RIP  [<ffffffff8113ccaa>] put_page+0xa/0x60
 RSP <ffff8802124bbc78>
CR2: 000000000000037f

Comment 73 Richard Chan 2013-07-28 02:44:54 UTC

+1 for 3.10.3-300.fc19.x86_64: it has been behaving well.

I have used a Windows 7 64-bit VM with CIFS (to a share on the host) and internet access overnight; it has not triggered the oops. This VM used to trigger both this oops and the other infamous 980254.

Thank you very much to both fedora kernel team  and testers.

Comment 74 Stephen Murray 2013-07-29 23:19:39 UTC

Put 3.10.3-300.fc19.x86_64 onto two of my desktops with KVM, no problems so far after 24 hours of running. Thanks !

Comment 75 Josh Boyer 2013-08-16 14:02:11 UTC

We've not seen any recent reports of these issues.  I believe they are now fixed in all supported releases with the 3.10.y and 3.11-rcX kernels.

Is anyone still having this issue?

Comment 76 Stephen Murray 2013-08-16 14:04:32 UTC

No problems at all for me after almost three weeks of running.

Comment 77 James Boyle 2013-08-16 14:36:03 UTC

(In reply to Josh Boyer from comment #75)
> We've not seen any recent reports of these issues.  I believe they are now
> fixed in all supported releases with the 3.10.y and 3.11-rcX kernels.
> 
> Is anyone still having this issue?

I haven't seen any issues on 3.10.4-100 or the latest 3.10.6-100 for Fedora 18.  Everything looks OK!

Thanks again,
--James

Note You need to log in before you can comment on or make changes to this bug.

balay
bhubbard
brian.maddox
dkelson
drjones
gansalmon
ipilcher
itamar
jasowang
jonathan
kagesenshi.87
kernel-maint
madhu.chinakonda
melbournemark
michele
mst
murraysj
pcfe
rc556677
riel
Shurik
sjorna
smooge
stephen
yogeshsharma
yukawa
zbechir