Bug 927629 - virtio_net: WARNING: at lib/list_debug.c:29 __list_add+0x77/0xd0()
Summary: virtio_net: WARNING: at lib/list_debug.c:29 __list_add+0x77/0xd0()
Keywords:
Status: CLOSED INSUFFICIENT_DATA
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: x86_64
OS: Unspecified
unspecified
unspecified
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard: abrt_hash:b64590cf8f922aa40db624e9f94...
: 947070 953920 (view as bug list)
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-26 11:36 UTC by Eric Blake
Modified: 2013-09-30 11:35 UTC (History)
11 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-09-30 11:35:06 UTC
Type: ---
Embargoed:


Attachments (Terms of Use)
patch to suppress napi instance list modification (1.13 KB, patch)
2013-04-22 15:54 UTC, Neil Horman
no flags Details | Diff

Description Eric Blake 2013-03-26 11:36:34 UTC
Description of problem:
running a VM, and used libvirt to send ACPI power request which attempted to put VM into S3

Additional info:
WARNING: at lib/list_debug.c:29 __list_add+0x77/0xd0()
Hardware name: Bochs
list_add corruption. next->prev should be prev (ffff880036950060), but was           (null). (next=ffff880036f77450).
Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bnep bluetooth rfkill ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi microcode virtio_balloon virtio_net i2c_piix4 i2c_core uinput virtio_blk
Pid: 8611, comm: kworker/u:4 Not tainted 3.8.4-202.fc18.x86_64 #1
Call Trace:
 [<ffffffff8105e62f>] warn_slowpath_common+0x7f/0xc0
 [<ffffffff8105e726>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff8130a9a7>] __list_add+0x77/0xd0
 [<ffffffffa00318f5>] ? init_vqs+0x55/0x490 [virtio_net]
 [<ffffffff81542fc8>] netif_napi_add+0x48/0x70
 [<ffffffffa0031994>] init_vqs+0xf4/0x490 [virtio_net]
 [<ffffffffa0031d4f>] virtnet_restore+0x1f/0x100 [virtio_net]
 [<ffffffff8139729e>] virtio_pci_restore+0x7e/0xb0
 [<ffffffff813244a3>] pci_pm_resume+0x73/0xd0
 [<ffffffff81324430>] ? pci_pm_restore+0xd0/0xd0
 [<ffffffff813f2b08>] dpm_run_callback+0x58/0x90
 [<ffffffff813f34de>] device_resume+0xde/0x200
 [<ffffffff813f3621>] async_resume+0x21/0x50
 [<ffffffff81089980>] async_run_entry_fn+0xb0/0x1b0
 [<ffffffff8107a693>] process_one_work+0x163/0x490
 [<ffffffff8107ceee>] worker_thread+0x15e/0x450
 [<ffffffff8107cd90>] ? busy_worker_rebind_fn+0x110/0x110
 [<ffffffff81081fb0>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_entry+0xa0/0xf0
 [<ffffffff81081ef0>] ? kthread_create_on_node+0x120/0x120
 [<ffffffff8165882c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81081ef0>] ? kthread_create_on_node+0x120/0x120

Comment 1 Dave Jones 2013-04-08 22:44:32 UTC
*** Bug 947070 has been marked as a duplicate of this bug. ***

Comment 2 Neil Horman 2013-04-19 20:26:25 UTC
Ugh, they're deleting and re-adding napi_instances while the device is registered, the protection on that list is handled with flags and enabling/disabling of the softirq context.  Thats wrong.  I'll write a patch for this on monday.  We shouldn't have to add/delete napi instances like this at all.

Comment 3 Neil Horman 2013-04-22 15:54:57 UTC
Created attachment 738573 [details]
patch to suppress napi instance list modification

This patch should fix your issue.  free_netdev will handle the cleanup of napi instances for virtio_net, so all we have to do is allocate them and make sure they don't get messed with in the tx/rx paths in parallel with the suspend/resume path.  I've not tested this patch, but if you could and confirm that it solves your problem I would appreciate it.  Thanks!

Comment 4 Dave Jones 2013-04-22 18:34:18 UTC
*** Bug 953920 has been marked as a duplicate of this bug. ***

Comment 5 Neil Horman 2013-04-30 17:57:19 UTC
ping eric, any feedback here?

Comment 6 Eric Blake 2013-04-30 18:24:08 UTC
Is your patch available in an already-built kernel?  I'm assuming that all I would have to do to test it is load a fixed kernel into my VM, then attempt the same action as what triggered the failure the first time?

Comment 7 Eric Blake 2013-04-30 18:29:35 UTC
(In reply to comment #3)
> Created attachment 738573 [details]
> patch to suppress napi instance list modification

I've never built my own kernel before - it would save me some effort if there was a scratch build somewhere with this patch already applied that I could test.

Comment 8 Neil Horman 2013-04-30 20:40:11 UTC
I figured comming from an RH address, you would be able to build your own.  Sorry, here you go:
http://koji.fedoraproject.org/koji/taskinfo?taskID=5319015

Comment 9 Eric Blake 2013-04-30 21:06:30 UTC
(In reply to comment #8)
> I figured comming from an RH address, you would be able to build your own. 

Capable: probably, given enough time.  Interested: Definitely - I'm not opposed to the idea of learning how to build my own kernel.  But enough time available: there's the rub.  I doubt that I would manage to learn everything needed to churn out my first kernel without taking up valuable time from my other tasks.  Not every RH address represents a kernel developer :)

> Sorry, here you go:
> http://koji.fedoraproject.org/koji/taskinfo?taskID=5319015

Thanks!  Testing now...

Comment 10 Eric Blake 2013-05-01 03:13:08 UTC
I downloaded kernel-3.8.10-200.bz927629.fc18.x86_64 into my VM, and used yum localinstall to set it up.  Everything seemed to boot, but whereas an S3 suspend in kernel-3.8.9-200.fc18.x86_64 would resume after a keystroke, an S3 suspend in the scratch build remained on a black screen even after doing the same steps that woke up the released kernel version.  I'm not sure if something else differs between the released 3.8.9-200 and your 3.8.10-200.bz927629 to know if it your patch at fault or something else, nor how to tell if your patch fixed anything, but I do know that your build didn't resume properly.  Let me know if there's anything else I need to do to help you.

Comment 11 Neil Horman 2013-05-01 12:51:39 UTC
Yes, you can go get a 3.8.10-200 kernel from brew and see if the same problem occurs.  Theres nothing in the patch I provided you that should cause resume to stop.  I expect you've hit a different problem.

Comment 12 Josh Boyer 2013-05-01 13:01:41 UTC
(In reply to comment #11)
> Yes, you can go get a 3.8.10-200 kernel from brew and see if the same
> problem occurs.  Theres nothing in the patch I provided you that should
> cause resume to stop.  I expect you've hit a different problem.

Note: Neil means koji, not brew.

Comment 13 Neil Horman 2013-05-01 13:11:42 UTC
yes, thank you Josh.  You'll not yet find 3.8.10 in brew, only in koji.

Comment 14 Eric Blake 2013-05-02 02:18:55 UTC
I tried 3.8.10-200 from koji, and it was able to recover from S3 suspend, although it changed the monitor resolution (dropping from 1024x768 down to 640x480) on resume, and also opened up an ABRT window that still mentioned a kernel oops in __list_add.

Comment 15 Eric Blake 2013-05-02 02:38:58 UTC
In the 3.8.10-200.bz927629 build, the VM does transition out of S3 back into "running", at least according to qemu in the host, but is stuck at 100% cpu.  Unfortunately, I don't know enough about debugging qemu or kernels to know how to get a useful trace of where the kernel seems to be stuck.

Comment 16 Neil Horman 2013-05-02 13:50:42 UTC
do you get any output on the console during the transition out of S3?  If so I can instrument the kernel with printk's to get a general idea of where we are spending time.

Comment 17 Eric Blake 2013-05-02 14:02:50 UTC
Nothing.  I used 'virsh dompmwakeup guest' from the host to trigger the wakeup, and the only change is that qemu switches from reporting suspended and near 0% cpu over to live and near 100% cpu, but no change to the console - it went black when entering S3, and never changes from black on resume.

Comment 18 Neil Horman 2013-05-02 14:08:23 UTC
ok, well the only other action I can think to take is to apply my patch to a 3.8.4-202 kernel (which I think otherwise came out of suspend), and see if it fixes your list_add problem.  If it does, I can move forward with this fix, and we can address the cpu churn separately.  I'll spin another kernel shortly.

Comment 19 Neil Horman 2013-05-02 15:23:07 UTC
http://koji.fedoraproject.org/koji/taskinfo?taskID=5325167

Here you go.  Older kernel to test with my patch.  Just FYI, this kernel doesn't include it, but commit 0d7b2212f7b2add8f38aa06debc462749aded700 looks like it might have some relevance to your suspend issue.

Comment 20 Neil Horman 2013-05-08 20:05:09 UTC
Ping, Eric, any test feedback on the kernel in comment 19?

Comment 21 Eric Blake 2013-05-08 21:22:49 UTC
Just tested the build in comment 19 (3.8.4-202.927629), and it has the same symptoms - when the host tells qemu to wake up the guest, the guest transitions into a running state, but consumes 100% cpu and leaves the screen blank (same as comment 17).

I also tested the latest upstream (3.8.11-200), which still has the same symptoms as other unpatched kernels (same as comment 14).

Comment 22 Eric Blake 2013-05-08 22:44:39 UTC
bug 961143 mentions that the resolution glitches on resume may be a spice bug, things may behave better once that is fixed...

Comment 23 Neil Horman 2013-05-09 13:30:15 UTC
whats the process for suspending a virtual machine? Is hitting the pause button in virt manager suficient?  If you can detail your reproducer for me, I'll try to get it setup on my system so that I can tinker with it futher.

Comment 24 Eric Blake 2013-05-09 14:27:40 UTC
There are several ways to play with S3 in a guest; my setup involved F18 installed as both host and guest.  Basically you want anything that will trigger the guest to see an ACPI request to go into S3, whether triggered from the guest or the host.  I've tested that all of the methods mentioned below for both suspend and for wakeup have the same net result (that is, the cause of the trigger doesn't matter, as long as the S3 event gets triggered).

Simplest is probably triggering S3 from the guest: I log in to the guest, ensure 'pm-utils' is installed, and issue 'pm-suspend' as root.  If you have a GNOME session up, you can also click your name on the right, hold the ALT key and click SUSPEND on that menu.

You can also trigger a suspend from the host, although that requires wiring up qemu-guest-agent on both host (changing the domain XML to add a channel to the agent) and the guest (F18 doesn't install qemu-guest-agent by default).  With that in place, you can do 'virsh dompmsuspend $guest --target=memory' from the host.  If you want to try this approach instead of suspending from within the guest, and need more help, let me know.

Once suspended, qemu lets you resume a guest if you send a keypress to the guest's window, or you can use 'virsh dompmwakeup $guest'.

You can check from the host the current state of qemu running the guest; 'virsh list' will show either "running" or "pmsuspended".  I also use virt-manager, which gives a nice graphical indication of whether the guest appears to be churning through 100% cpu.

Comment 25 Neil Horman 2013-05-09 15:26:42 UTC
Thanks Eric, I'll try get to this shortly

Comment 26 Neil Horman 2013-06-04 16:07:56 UTC
hmm, sorry for the delay, I finally got back to working on this.  I just tried to reproduce, and both my VM guest and my host system, running the 3.9.4-200.fc18 kernel, are unable to make this problem recur.  I can use the pm-utils to suspend the VM, and when I take it out of suspend, it comes right back, no warnings recorded, no cpu spikes.

starting to wonder if maybe commit 008d4278072216bd2459a6e41b07b688fe95ee83 fixed at least part of this

Comment 27 Neil Horman 2013-06-13 19:10:50 UTC
ping, any respose here?  I figure I'll close this as worksforme if you can't reproduce on the latest kernel

Comment 28 Adam Williamson 2013-06-26 08:37:14 UTC
Description of problem:
Doing KDE desktop tests on F19 Final RC2, popped up after running lots of apps and switching user accounts a few times.

Version-Release number of selected component:
kernel

Additional info:
reporter:       libreport-2.1.5
cmdline:        BOOT_IMAGE=/vmlinuz-3.9.5-301.fc19.x86_64 root=/dev/mapper/fedora-root ro rd.lvm.lv=fedora/swap rd.md=0 rd.dm=0 rd.luks=0 vconsole.font=latarcyrheb-sun16 rd.lvm.lv=fedora/root vconsole.keymap=uk rhgb quiet
kernel:         3.9.5-301.fc19.x86_64
runlevel:       N 5
type:           Kerneloops

Truncated backtrace:
WARNING: at lib/list_debug.c:29 __list_add+0x65/0xc0()
Hardware name: Bochs
list_add corruption. next->prev should be prev (ffff88007b276060), but was           (null). (next=ffff88007bd9f050).
Modules linked in: bnep bluetooth rfkill nf_conntrack_netbios_ns nf_conntrack_broadcast ipt_MASQUERADE ip6table_nat nf_nat_ipv6 ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ebtable_filter ebtables ip6table_filter ip6_tables snd_hda_intel snd_hda_codec snd_hwdep snd_seq snd_seq_device snd_pcm microcode virtio_balloon virtio_net snd_page_alloc snd_timer snd soundcore i2c_piix4 uinput qxl drm_kms_helper virtio_blk sym53c8xx scsi_transport_spi ttm drm i2c_core
Pid: 36, comm: kworker/u:1 Not tainted 3.9.5-301.fc19.x86_64 #1
Call Trace:
 [<ffffffff81306a00>] ? btree_merge+0x70/0x100
 [<ffffffff8105cc26>] warn_slowpath_common+0x66/0x80
 [<ffffffff8105cc8c>] warn_slowpath_fmt+0x4c/0x50
 [<ffffffff8108ef06>] ? __cond_resched+0x26/0x30
 [<ffffffff81306af5>] __list_add+0x65/0xc0
 [<ffffffff815361e8>] netif_napi_add+0x48/0x70
 [<ffffffffa00e899c>] init_vqs+0xfc/0x480 [virtio_net]
 [<ffffffff8151fcf4>] ? raw_pci_write+0x24/0x50
 [<ffffffff813165aa>] ? pci_bus_write_config_byte+0x5a/0x70
 [<ffffffffa00e8d3d>] virtnet_restore+0x1d/0xf0 [virtio_net]
 [<ffffffff813920b6>] virtio_pci_restore+0x66/0xa0
 [<ffffffff813200b4>] pci_pm_resume+0x64/0xb0
 [<ffffffff81320050>] ? pci_pm_thaw+0x90/0x90
 [<ffffffff813ed414>] dpm_run_callback+0x44/0x90
 [<ffffffff813ed56e>] device_resume+0xce/0x1f0
 [<ffffffff813ed6ad>] async_resume+0x1d/0x50
 [<ffffffff81086b79>] async_run_entry_fn+0x39/0x120
 [<ffffffff810798ce>] process_one_work+0x16e/0x3f0
 [<ffffffff8107b34f>] worker_thread+0x10f/0x3d0
 [<ffffffff8107b240>] ? manage_workers+0x340/0x340
 [<ffffffff810801f0>] kthread+0xc0/0xd0
 [<ffffffff81080130>] ? insert_kthread_work+0x40/0x40
 [<ffffffff8164e6ec>] ret_from_fork+0x7c/0xb0
 [<ffffffff81080130>] ? insert_kthread_work+0x40/0x40

Comment 29 Neil Horman 2013-06-26 14:33:24 UTC
Adam, that kernel doesn't have the patch I included here, can you try it with my patch?

Comment 30 Adam Williamson 2013-06-26 14:46:06 UTC
Neil: unfortunately I don't have a clean reproducer, the above report came after a couple hours of launching every app listed on the KDE menus...

Comment 31 Adam Williamson 2013-06-28 18:09:44 UTC
Is https://bugzilla.redhat.com/show_bug.cgi?id=979529 this again, or something else?

Comment 32 Neil Horman 2013-06-28 18:44:23 UTC
Nope, thats something else.  Looks like its a failure in the nouveau graphics driver

Comment 33 Eric Blake 2013-06-28 19:38:01 UTC
(In reply to Neil Horman from comment #27)
> ping, any respose here?  I figure I'll close this as worksforme if you can't
> reproduce on the latest kernel

Sorry, I was on vacation most of June.  Let me fire up my VM and repeat the test with the latest kernel...

Comment 34 Eric Blake 2013-06-28 21:29:56 UTC
Tested with kernel-3.9.6-200.fc18.x86_64 in the guest (latest F18 stable), and ABRT still pops up a box stating a problem has been detected in the kernel when the guest resumes, even though the guest appears to function fine after coming out of S3.

Comment 35 Neil Horman 2013-06-30 23:27:34 UTC
Eric, again, as in comment 32, the latest kernel doesn't have my fix in place.  I'm asking you all to test a kernel with my patch in place, i.e. the kernel I built in comment 19.  Can you please do that instead of telling me that the problem still exists in a kernel that hasn't been patched?

Comment 36 Adam Williamson 2013-07-01 05:40:48 UTC
Neil: comments #26 and #27 suggested that you thought it had been fixed outside of that patch. You asked Eric to try again with a recent kernel and set 'needinfo?'. He just responded to that needinfo request. It helps if you don't get confused about what you're asking people. :)

Comment 37 Neil Horman 2013-07-01 12:39:08 UTC
Adam, Eric, I'm sorry, I'm working on two virtio_net problems with similar descriptions, and I thought this was the other one.  You're absolutely right, I was confused about what I was asking from whoom.  

That begs the question though, whats going on here.  Eric, you say that it comes back from S3 ok, but ABRT still pops up a message about a problem.  Can you post the backtrace here?  Lets see if its the same thing that the initial description indicated, or something different.

Comment 38 Neil Horman 2013-07-12 13:54:18 UTC
ping, Eric?

Comment 39 Neil Horman 2013-07-22 17:05:39 UTC
ping, if theres no response no this soon, I'll close it.  If you want to work on it again at a later date, please reopen

Comment 40 Eric Blake 2013-07-22 20:10:07 UTC
(In reply to Neil Horman from comment #39)
> ping, if theres no response no this soon, I'll close it.  If you want to
> work on it again at a later date, please reopen

I would love to get this issue resolved.  What kernel do you want me to test, and how do I get a backtrace in order to post it here?

Comment 41 Neil Horman 2013-07-22 20:25:52 UTC
I want what I asked for from Adam in comment 37.  A post of the backtrace that he got when he encountered that error, so I can make sure its the same problem we're dealing with.

Comment 42 Eric Blake 2013-07-22 20:29:11 UTC
Which kernel do you want me to test, and how do I collect the backtrace?

Comment 43 Adam Williamson 2013-07-22 22:32:19 UTC
Neil: whatever you need you'll want to get from Eric; as I said, I don't have any clean reproducer for this. I'm just trying to act as conversational WD40...

Comment 44 Neil Horman 2013-07-23 13:24:02 UTC
Well, everything I've built previously has long since expired.  The latest kernel built along with my patch from comment 3 would be good. Heres a new build:
http://koji.fedoraproject.org/koji/taskinfo?taskID=5644219

You can collect the backtrace via any method you like.  Some people use Abrt, others connect a virtual serial port to their guest and redirect console out that way for a capture to file

Comment 45 Eric Blake 2013-07-23 15:31:21 UTC
Give me a moment - that scratch build is an F19 kernel, but my vm was still running F18, so I'm in the middle of upgrading it in order to test your scratch build.

Comment 46 Josh Boyer 2013-07-23 15:38:05 UTC
(In reply to Eric Blake from comment #45)
> Give me a moment - that scratch build is an F19 kernel, but my vm was still
> running F18, so I'm in the middle of upgrading it in order to test your
> scratch build.

You didn't need to do that.  F19 kernels will work fine on F18 (and vice versa).

Comment 47 Eric Blake 2013-07-23 15:40:39 UTC
Good to know. Thankfully, I took a snapshot, so it was quick work to revert to that snapshot, and now I'll try booting the f19 kernel on top of the f18 install.

Comment 48 Eric Blake 2013-07-23 16:03:42 UTC
No joy: kernel-3.10.2-301.fc19.x86_64 is not giving me any graphics at boot.  I can't even get it to the point of letting me use a tty or attempt an S3 suspend.

Comment 49 Neil Horman 2013-07-23 17:53:38 UTC
Eric, so attach a virtual serial console, set your command line to direct console output there, and see where you get before you crash.  Although it sounds like this problem has nothing to do with the crash you encountered just now.

If you need to, I imagine Adam has a copy of the f18 kernel I built previously

Comment 50 Eric Blake 2013-07-23 18:50:50 UTC
(In reply to Neil Horman from comment #49)
> Eric, so attach a virtual serial console, set your command line to direct
> console output there, and see where you get before you crash.  Although it
> sounds like this problem has nothing to do with the crash you encountered
> just now.

Indeed, the inability to boot with graphics seems unrelated to the S3 issue, unless it is the changes that you made that are causing the graphics to be unavailable.  I am really not an expert on setting up direct console access to a VM, so please bear with me.  I am really struggling on trying to get you the information you want, but really want to help in getting this bug squashed.

> 
> If you need to, I imagine Adam has a copy of the f18 kernel I built
> previously

I still have access to the kernel mentioned in comment 19 (3.8.4-202.927629.fc18) as well as the one mentioned in comment 8 (3.8.10-200.bz927629.fc18), if you need me to roll back to either of those versions and try and grab a stack trace.

Comment 51 Eric Blake 2013-07-23 19:14:48 UTC
If it helps, I just tested the latest stable F18 kernel: 3.9.10-200.fc18.x86_64.  On resume from S3, it caused the following backtrace to be logged in ABRT:

WARNING: at lib/list_debug.c:29 __list_add+0x77/0xd0()
Hardware name: Bochs
list_add corruption. next->prev should be prev (ffff880036e0c060), but was           (null). (next=ffff88003b9b6850).
Modules linked in: fuse ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bnep bluetooth rfkill ebtable_filter ebtables ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi microcode virtio_balloon virtio_net i2c_piix4 i2c_core uinput virtio_blk
Pid: 39, comm: kworker/u:2 Not tainted 3.9.10-200.fc18.x86_64 #1
Call Trace:
 [<ffffffff8105efc5>] warn_slowpath_common+0x75/0xa0
 [<ffffffff8105f0a6>] warn_slowpath_fmt+0x46/0x50
 [<ffffffff813145f7>] __list_add+0x77/0xd0
 [<ffffffffa00309e5>] ? init_vqs+0x55/0x490 [virtio_net]
 [<ffffffff8154f448>] netif_napi_add+0x48/0x70
 [<ffffffffa0030a8c>] init_vqs+0xfc/0x490 [virtio_net]
 [<ffffffff81324309>] ? pci_bus_write_config_byte+0x69/0x90
 [<ffffffffa0030e3f>] virtnet_restore+0x1f/0x100 [virtio_net]
 [<ffffffff813a1efe>] virtio_pci_restore+0x7e/0xb0
 [<ffffffff8132e093>] pci_pm_resume+0x73/0xd0
 [<ffffffff8132e020>] ? pci_pm_restore+0xd0/0xd0
 [<ffffffff813fee98>] dpm_run_callback+0x58/0x90
 [<ffffffff813ff86e>] device_resume+0xde/0x200
 [<ffffffff813ff9b1>] async_resume+0x21/0x50
 [<ffffffff81089886>] async_run_entry_fn+0x46/0x140
 [<ffffffff8107b6d3>] process_one_work+0x173/0x3c0
 [<ffffffff8107cfff>] worker_thread+0x10f/0x390
 [<ffffffff8107cef0>] ? busy_worker_rebind_fn+0xb0/0xb0
 [<ffffffff81082b70>] kthread+0xc0/0xd0
 [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_flush+0x20/0xb0
 [<ffffffff81082ab0>] ? kthread_create_on_node+0x120/0x120
 [<ffffffff81667f6c>] ret_from_fork+0x7c/0xb0
 [<ffffffff81082ab0>] ? kthread_create_on_node+0x120/0x120

Comment 52 Neil Horman 2013-07-23 19:32:18 UTC
Thank you Eric, but unfortunately, that doesn't help, we already know you can reproduce the bug without my patch, what we need to know is if you can reproduce the bug with my patch, and if it results in the same backtrace.

Comment 53 Eric Blake 2013-07-23 19:37:47 UTC
The kernel in comment 19 (3.8.4-202.927629.fc18) goes into a 100% CPU spin after resuming from S3, so there is no way that I know of to get a backtrace from that kernel.  I'm trying again with the kernel in comment 44 (3.10.2-301.fc19.x86_64) with 'rhgb quiet' nuked from the grub2 config line; while booting, it is able to show lots of 'Started ...' messages, and the display doesn't disappear until the point in the boot process where it normally switches to graphics mode to display gdm.  I'm still trying to figure out how to hook up console access to force the kernel to send output to the virtual serial port.

Comment 54 Adam Williamson 2013-07-23 19:40:35 UTC
Why don't you just boot the fc19 kernel in runlevel 3 and use 'pm-suspend' to suspend it?

Comment 55 Eric Blake 2013-07-23 20:14:01 UTC
I figured out how to get the kernel to output to a serial console (basically, modify the grub line to add 'console=tty0 console=ttyS0'), but 3.10.2-301.fc19 is still not happy.  This is the last output it gives me:

[*     ] A start job is running for Wait for Plymouth Boot Screen to Quit[   19.466044] Adjusting kvm-clock more than 11% (9371652 vs 9311354)
[FAILED] Failed to start Wait for Plymouth Boot Screen to Quit.
See 'systemctl status plymouth-quit-wait.service' for details.
         Starting Serial Getty on ttyS0...
[  OK  ] Started Serial Getty on ttyS0.
[  OK  ] Reached target Login Prompts.
[  OK  ] Reached target Multi-User System.
[  OK  ] Reached target Graphical Interface.

Fedora release 18 (Spherical Cow)
Kernel 3.10.2-301.fc19.x86_64 on an x86_64 (ttyS0)

localhost login: root
Password: 
Login incorrect

It is fine letting me type for the login: prompt, but every time the Password: prompt comes up, just typing any character (such as spacebar) behaves as if I had hit 'enter' with an empty password, which of course is incorrect.  Thus, I can't log in, even on the serial console, in order to trigger an S3 from inside the guest.

Comment 56 Eric Blake 2013-07-23 20:21:38 UTC
(In reply to Adam Williamson from comment #54)
> Why don't you just boot the fc19 kernel in runlevel 3 and use 'pm-suspend'
> to suspend it?

Yay - adding just '3' (and not the previous attempt that added 'console=ttyS0') got me further: I was actually able to log in via the graphic console.  Now trying the suspend...

pm-suspend took the machine into S3, then hitting Enter brought it out (virsh list says the status went from pmsuspended back to running).  The guest is not using 100% cpu, but it appears to be stuck; it is not giving me a prompt to type anything.  The display is stuck at:

[root@localhost ~]# pm-suspend
[  102.399744] PM: Synching filesystems ... done.

with no further indication of what happened after resuming.

Comment 57 Eric Blake 2013-07-23 21:09:33 UTC
In comparison, booting into runlevel 3 on 3.9.10-200.fc18, and issuing pm-suspend from there, followed by an Enter to wake it back up, never restored any screen contents; even though the guest wasn't at 100% cpu, it was unusable (when compared to the default runlevel, where it was usable after the ABRT mentioned the crash being detected)

Comment 58 Eric Blake 2013-07-23 22:08:37 UTC
Progress: I booted 3.10.2-301.fc19.x86_64 into runlevel 3, did pmsuspend, and on resume, I got the following to the serial console:

[   18.723337] PM: Syncing filesystems ... done.
[   18.724855] Freezing user space processes ... (elapsed 0.01 seconds) done.
[   18.737128] Freezing remaining freezable tasks ... (elapsed 0.01 seconds) done.
[   18.749873] Suspending console(s) (use no_console_suspend to debug)
[   18.851881] PM: suspend of devices complete after 101.158 msecs
[   18.851975] PM: late suspend of devices complete after 0.096 msecs
[   18.854532] PM: noirq suspend of devices complete after 2.515 msecs
[   18.854550] ACPI: Preparing to enter system sleep state S3
[   18.854618] PM: Saving platform NVS memory
[   18.854619] Disabling non-boot CPUs ...
[   18.855111] kvm-clock: cpu 0, msr 0:3e7e8001, primary cpu clock, resume
[   18.855111] ACPI: Low-level resume complete
[   18.855111] ACPI Exception: AE_NOT_FOUND, While evaluating Sleep State [\_S0_] (20130328/hwxface-568)
[   18.855111] PM: Restoring platform NVS memory
[   18.855111] ACPI: Waking up from system sleep state S3
[   18.864814] PM: noirq resume of devices complete after 9.283 msecs
[   18.864931] PM: early resume of devices complete after 0.066 msecs
[   18.865108] pci 0000:00:01.0: PIIX3: Enabling Passive Release
[   18.866310] usb usb1: root hub lost power or was reset
[   19.319589] usb 1-1: reset full-speed USB device number 2 using uhci_hcd
[   19.465998] PM: resume of devices complete after 601.041 msecs
[   19.498266] Restarting tasks ... done.
[   19.676842] BUG: unable to handle kernel paging request at 0000000036560020
[   19.677087] IP: [<ffffffffa00a0da0>] try_fill_recv+0x30/0x530 [virtio_net]
[   19.677087] PGD 0 
[   19.677087] Oops: 0000 [#1] SMP 
[   19.677087] Modules linked in: ebtable_nat ipt_MASQUERADE nf_conntrack_netbios_ns nf_conntrack_broadcast ip6table_mangle ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 iptable_nat nf_nat_ipv4 nf_nat iptable_mangle nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack bnep bluetooth rfkill ebtable_filter ebtables be2iscsi iscsi_boot_sysfs ip6table_filter bnx2i cnic uio ip6_tables cxgb4i cxgb4 cxgb3i cxgb3 mdio libcxgbi ib_iser rdma_cm ib_addr iw_cm ib_cm ib_sa ib_mad ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi microcode virtio_balloon virtio_net i2c_piix4 uinput qxl drm_kms_helper ttm virtio_blk drm i2c_core
[   19.677087] CPU: 0 PID: 499 Comm: NetworkManager Not tainted 3.10.2-301.fc19.x86_64 #1
[   19.677087] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
[   19.677087] task: ffff880036186dc0 ti: ffff88003bb3a000 task.ti: ffff88003bb3a000
[   19.677087] RIP: 0010:[<ffffffffa00a0da0>]  [<ffffffffa00a0da0>] try_fill_recv+0x30/0x530 [virtio_net]
[   19.677087] RSP: 0018:ffff88003bb3b718  EFLAGS: 00010286
[   19.677087] RAX: 0000000036560000 RBX: 0000000000000000 RCX: 0000000000000052
[   19.677087] RDX: ffff880036013b50 RSI: 00000000000000d0 RDI: ffff880038d2e000
[   19.677087] RBP: ffff88003bb3b770 R08: 0000000000000000 R09: 00000000ffffffa1
[   19.677087] R10: ffff8800364d9000 R11: 0000000000000000 R12: ffff880036562000
[   19.677087] R13: ffff880038d2e090 R14: 000077ff80000000 R15: ffff880038d2e000
[   19.677087] FS:  00007f10d3273840(0000) GS:ffff88003e400000(0000) knlGS:0000000000000000
[   19.677087] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   19.677087] CR2: 0000000036560020 CR3: 000000003b480000 CR4: 00000000000006f0
[   19.677087] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   19.677087] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[   19.677087] Stack:
[   19.677087]  00000000ffffffed 0000000000000000 000000000000000d ffff880036562000
[   19.677087]  ffff88003bb3b770 000000d08164c84c 0000000000000000 ffff880036562000
[   19.677087]  0000000000000000 ffff880036562840 0000000000000000 ffff88003bb3b7a0
[   19.677087] Call Trace:
[   19.677087]  [<ffffffffa00a1b9c>] virtnet_open+0x7c/0xb0 [virtio_net]
[   19.677087]  [<ffffffff8153e80f>] __dev_open+0xbf/0x140
[   19.677087]  [<ffffffff8153eac2>] __dev_change_flags+0x92/0x170
[   19.677087]  [<ffffffff8153ec3d>] dev_change_flags+0x1d/0x60
[   19.677087]  [<ffffffff8154c049>] do_setlink+0x339/0xa00
[   19.677087]  [<ffffffff8112dc3d>] ? find_get_page+0x2d/0x100
[   19.994115]  [<ffffffff81305202>] ? nla_parse+0x32/0xe0
[   19.994115]  [<ffffffff8154d3d4>] rtnl_newlink+0x394/0x5e0
[   19.994115]  [<ffffffff8128a1be>] ? selinux_capable+0x2e/0x40
[   19.994115]  [<ffffffff81549f29>] rtnetlink_rcv_msg+0x99/0x260
[   19.994115]  [<ffffffff81287285>] ? sock_has_perm+0x75/0x90
[   19.994115]  [<ffffffff81549e90>] ? rtnetlink_rcv+0x30/0x30
[   19.994115]  [<ffffffff81568509>] netlink_rcv_skb+0xa9/0xc0
[   19.994115]  [<ffffffff81549e88>] rtnetlink_rcv+0x28/0x30
[   19.994115]  [<ffffffff81567bdd>] netlink_unicast+0xdd/0x190
[   19.994115]  [<ffffffff81567f56>] netlink_sendmsg+0x2c6/0x6c0
[   19.994115]  [<ffffffff81564e02>] ? netlink_seq_next+0x92/0xf0
[   19.994115]  [<ffffffff81566014>] ? netlink_recvmsg+0x204/0x390
[   19.994115]  [<ffffffff815255d9>] sock_sendmsg+0x99/0xd0
[   19.994115]  [<ffffffff81525c78>] ? sock_recvmsg+0xa8/0xe0
[   19.994115]  [<ffffffff815df317>] ? unix_dgram_sendmsg+0x547/0x600
[   19.994115]  [<ffffffff815259fe>] ___sys_sendmsg+0x39e/0x3b0
[   19.994115]  [<ffffffff81525b8e>] ? SYSC_sendto+0x17e/0x1c0
[   19.994115]  [<ffffffff815267e2>] __sys_sendmsg+0x42/0x80
[   19.994115]  [<ffffffff81526832>] SyS_sendmsg+0x12/0x20
[   19.994115]  [<ffffffff81650e99>] system_call_fastpath+0x16/0x1b
[   19.994115] Code: 55 48 89 e5 41 57 49 89 ff 41 56 49 be 00 00 00 80 ff 77 00 00 41 55 4d 8d af 90 00 00 00 41 54 53 48 83 ec 30 48 8b 07 89 75 d4 <48> 8b 50 20 48 8b ba c0 02 00 00 49 8d 57 70 48 89 55 c8 49 89 
[   19.994115] RIP  [<ffffffffa00a0da0>] try_fill_recv+0x30/0x530 [virtio_net]
[   19.994115]  RSP <ffff88003bb3b718>
[   19.994115] CR2: 0000000036560020
[   19.996181] ---[ end trace 559c3564f3ff8872 ]---

Hope that helps.

Comment 59 Neil Horman 2013-07-24 13:41:00 UTC
sooo, that looks like a completely different bug than what we've been encountering here previously.  Do multiple attempts at reproducing this bug result in the same backtrace?  I'm trying to determine if you've found another virtio_net bug, or if something bigger is going on here.

Comment 60 Eric Blake 2013-07-25 17:21:05 UTC
(In reply to Neil Horman from comment #59)
> sooo, that looks like a completely different bug than what we've been
> encountering here previously.  Do multiple attempts at reproducing this bug
> result in the same backtrace?  I'm trying to determine if you've found
> another virtio_net bug, or if something bigger is going on here.

I re-loaded 3.10.2-301.fc19.x86_64 and tried again; this time I got a 100% cpu usage loop.  So it looks like there may be several bugs in play, and which one hits is arbitrary.

Comment 61 Neil Horman 2013-07-25 17:59:00 UTC
Well, I don't know what to tell you then. I've got a f19 virt guest here running on an f18 host that never seems to hit this problem, so I'm currently unable to reproduce.  Is there anything special about your host that causes these bugs to trigger?

Comment 62 Eric Blake 2013-07-25 18:04:43 UTC
No idea what makes my host different from yours; I first hit the problem with a F18 host while using qemu from fedora-virt-preview; I've since upgraded my host to F19, but still using qemu from fedora-virt-preview (currently qemu-kvm-1.5.1-2.fc19.x86_64).  I have no idea how to debug 100% cpu loops on a stuck kernel - is there some keystroke I could send to interrupt the cpu and get a trace of where the loop is executing at?

Comment 63 Neil Horman 2013-07-25 18:22:58 UTC
yes, typically on baremetal I would use the magic sysrq keys (see Documentation/sysrq.txt- specifically I'd use alt-sysrq-l.  You'll likely have to enable it via sysctl after boot, but when you hit 100% cpu use whatever method qemu provides to send multiple keystrokes, and that command will dump the backtrace of all active cpus.

Note, I'd open a separate bz for the 100% cpu issue, to avoid confusion with debugging this issue.

Comment 64 Neil Horman 2013-09-11 18:43:27 UTC
ping, any further info here?  I'm still unable to reproduce this


Note You need to log in before you can comment on or make changes to this bug.