Bug 725332

Summary:	(RHEL-6.1 KVM SMP virtio_net) BUG: scheduling while atomic: swapper/0/0x00000100
Product:	Red Hat Enterprise Linux 6	Reporter:	Warren Togami <wtogami>
Component:	kernel	Assignee:	Gleb Natapov <gleb>
Status:	CLOSED DUPLICATE	QA Contact:	Red Hat Kernel QE team <kernel-qe>
Severity:	urgent	Docs Contact:
Priority:	unspecified
Version:	6.1	CC:	andy.wallis, chorn, chrisw, herbert.xu, hqucocl, kevin, knoel, max.karavaev, mchristi, mishu, riel, tburke
Target Milestone:	rc
Target Release:	---
Hardware:	x86_64
OS:	Unspecified
Whiteboard:
Fixed In Version:		Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2011-09-25 20:16:47 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:

Description Warren Togami 2011-07-25 07:40:12 UTC

RHEL-6 x86_64 kvm guest mounting ext4 over iscsi.  This server is a Linux mirror server with rsync, lighttpd and vsftpd.

After several hours of uptime this happens, with the second BUG repeating several times per second.  Please let me know if you need more information.

Jul 24 21:05:40 mirror kernel: BUG: scheduling while atomic: swapper/0/0x00000100
Jul 24 21:05:40 mirror kernel: Modules linked in: autofs4 sunrpc sg sd_mod crc_t10dif nf_conntrack_ftp ipt_REJECT xt_helper nf_conntrack_ipv4 f_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod [last unloaded: speedstep_lib]
Jul 24 21:05:40 mirror kernel: Pid: 0, comm: swapper Not tainted 2.6.32-131.6.1.el6.x86_64 #1
Jul 24 21:05:40 mirror kernel: Call Trace:
Jul 24 21:05:40 mirror kernel: [<ffffffff81055cb6>] ? __schedule_bug+0x66/0x70
Jul 24 21:05:40 mirror kernel: [<ffffffff814db1b2>] ? thread_return+0x5d9/0x777
Jul 24 21:05:40 mirror kernel: [<ffffffff81093284>] ? hrtimer_start_range_ns+0x14/0x20
Jul 24 21:05:40 mirror kernel: [<ffffffff81009ebe>] ? cpu_idle+0xee/0x110
Jul 24 21:05:40 mirror kernel: [<ffffffff814c305a>] ? rest_init+0x7a/0x80
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbdf28>] ? start_kernel+0x41d/0x429
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbd33a>] ? x86_64_start_reservations+0x125/0x129
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbd438>] ? x86_64_start_kernel+0xfa/0x109
Jul 24 21:05:40 mirror kernel: NOHZ: local_softirq_pending 20a
Jul 24 21:05:40 mirror kernel: BUG: scheduling while atomic:
swapper/0/0x00000100
Jul 24 21:05:40 mirror kernel: Modules linked in: autofs4 sunrpc sg sd_mod crc_t10dif nf_conntrack_ftp ipt_REJECT xt_helper nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables be2iscsi iscsi_boot_sysfs bnx2i cnic uio cxgb4i cxgb4 cxgb3i libcxgbi cxgb3 mdio ib_iser rdma_cm ib_cm iw_cm ib_sa ib_mad ib_core ib_addr ipv6 iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log microcode virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio ata_generic pata_acpi ata_piix dm_mod
[last unloaded: speedstep_lib]
Jul 24 21:05:40 mirror kernel: Pid: 0, comm: swapper Not tainted
2.6.32-131.6.1.el6.x86_64 #1
Jul 24 21:05:40 mirror kernel: Call Trace:
Jul 24 21:05:40 mirror kernel: [<ffffffff81055cb6>] ? __schedule_bug+0x66/0x70
Jul 24 21:05:40 mirror kernel: [<ffffffff814db1b2>] ?
thread_return+0x5d9/0x777
Jul 24 21:05:40 mirror kernel: [<ffffffff81095005>] ?
sched_clock_local+0x25/0x90
Jul 24 21:05:40 mirror kernel: [<ffffffff8109e2ab>] ?
tick_nohz_stop_idle+0x3b/0x50
Jul 24 21:05:40 mirror kernel: [<ffffffff81009ebe>] ? cpu_idle+0xee/0x110
Jul 24 21:05:40 mirror kernel: [<ffffffff814c305a>] ? rest_init+0x7a/0x80
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbdf28>] ? start_kernel+0x41d/0x429
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbd33a>] ?
x86_64_start_reservations+0x125/0x129
Jul 24 21:05:40 mirror kernel: [<ffffffff81bbd438>] ?
x86_64_start_kernel+0xfa/0x109

Comment 2 Warren Togami 2011-07-25 07:45:59 UTC

http://mirror.ancl.hawaii.edu/
This issue is crippling our new Fedora mirror server.  Please help!

Comment 3 Warren Togami 2011-07-25 20:00:31 UTC

Workaround:
Dropping the kvm guest to one VCPU seems to prevent this problem.

Comment 4 Rik van Riel 2011-07-27 17:20:59 UTC

Warren, could you add some info on the configuration of the system that sees this issue?  What kinds of storage are attached?  What kind of network interface?  How much traffic?

Comment 5 Warren Togami 2011-07-28 20:55:57 UTC

Host Hardware
=============
HP DL360 G6
Intel Xeon E5540 with 6GB RAM (4 core with hyperthreading)
Netgear ReadyNAS 4200 serving 12TB array over iSCSI
/dev/sda: Hewlett-Packard Company Smart Array G6 controllers (rev 01)
eth0: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
eth1: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

Mirror VM
=========
(virtio_net) eth0 http://mirror.ancl.hawaii.edu
(virtio_net) eth1 (private interface to NAS)
(virtio_blk) / filesystem on local SCSI disk 
(iscsi)      /srv/mirror 12TB ext4 formatted iSCSI accessed via eth1

Workload
========
The mirror runs several rsync processes from cron to sync data from Fedora, EPEL, CentOS, Scientific Linux, The Document Foundation, Debian, Ubuntu and more.  Meanwhile, lighttpd, vsftpd and rsync daemon serve mirror content to clients.  All mirror traffic traverses both eth0 and eth1 as /srv/mirror content is stored on the ext4 formatted iSCSI NAS array.

Comment 6 Mike Christie 2011-07-28 22:10:44 UTC

(In reply to comment #5
> eth0: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)
> eth1: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20)

What iscsi driver are you using? The software iscsi driver, iscsi_tcp, or do your broadcom cards have iscsi offload enabled and are you using bnx2i for offloaded iscsi?

By default we will use iscsi_tcp. If when you are logged into the iscsi target you can run

iscsiadm -m session -P 3 | grep transport

to see if bnx2i is being used.

Comment 7 Warren Togami 2011-07-29 00:37:43 UTC

iscsi_tcp.  As noted above, the iscsi initiator is running inside the VM, communicating over virtio_net to the iscsi target, so it wouldn't be able to use hardware offloaded iscsi.

Comment 8 Michal Jaegermann 2011-07-30 21:59:29 UTC

Note: "BUG: scheduling while atomic: swapper/0/0x10000002" shows also in bug 726877 for 3.1.0-0.rc0... kernels with no virtio_net and no iscsi.

Comment 9 Warren Togami 2011-08-01 02:33:53 UTC

(In reply to comment #8)
> Note: "BUG: scheduling while atomic: swapper/0/0x10000002" shows also in bug
> 726877 for 3.1.0-0.rc0... kernels with no virtio_net and no iscsi.

This is most likely a different bug entirely.

I switched the RHEL-6.1 VM back to 3 VCPU's, but this time changed virtual eth0 and eth1 from virtio_net to e1000.  Several simultaneously streams both to/from the server, traversing eth0 via rsync and httpd, reading and writing /srv/mirror over eth1 to the iSCSI array for the past hour.  So far the kernel bug has not triggered.

So it would seem this is a bug specific to virtio_net.  Riel identified similar bugs fixed in other network drivers, but not libvirt_net?

Comment 10 cocl 2011-09-01 02:09:38 UTC

Someone has plan to fix it？ With no virtio，the  IO performance is very poor

Comment 11 Kevin Fenzi 2011-09-13 23:21:16 UTC

Finally seeing this here as well, but oddly on a machine that has no iscsi, only a lot of network traffic. 

kernel 2.6.32-131.12.1.el6.x86_64 here. 

kernel BUG: scheduling while atomic: swapper/0/0x00010000
Modules linked in: tun ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6table_raw ip6_tables ipv6 ext3 jbd virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan]
Pid: 0, comm: swapper Not tainted 2.6.32-131.12.1.el6.x86_64 #1
Call Trace:
 [<ffffffff81055cb6>] ? __schedule_bug+0x66/0x70
 [<ffffffff814db2e2>] ? thread_return+0x5d9/0x777
 [<ffffffff81095085>] ? sched_clock_local+0x25/0x90
 [<ffffffff8109e32b>] ? tick_nohz_stop_idle+0x3b/0x50
 [<ffffffff81009ebe>] ? cpu_idle+0xee/0x110
 [<ffffffff814c318a>] ? rest_init+0x7a/0x80
 [<ffffffff81c1df28>] ? start_kernel+0x41d/0x429
 [<ffffffff81c1d33a>] ? x86_64_start_reservations+0x125/0x129
 [<ffffffff81c1d438>] ? x86_64_start_kernel+0xfa/0x109

Comment 12 Kevin Fenzi 2011-09-16 17:58:01 UTC

Switching the machine to e1000 didn't seem to help. The problem re-occurs every few days. :(

Comment 13 Warren Togami 2011-09-17 10:42:54 UTC

Kevin, with virtio_net but one VCPU does it still occur?

Comment 14 Kevin Fenzi 2011-09-17 17:17:32 UTC

Not sure. I just set it to 1 vcpu (althought it has e1000 still in place). 
Will know in a day or two. ;)

Comment 15 Maxim V Karavaev 2011-09-17 19:26:05 UTC

I've got exactly the same issue on one vps (of the 4 on this host).
Host server is Intel SR1690WBR with two Xeon E5606 and 24Gb RAM.
VPS has 3 vcores, 8Gb ram and drbd disk.
#uname -a
Linux v1.xxxx.xx 2.6.32-131.6.1.el6.x86_64 #1 SMP Tue Jul 12 17:14:50 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux

Restart fixes this issue for a day or two, but it's not a good solution.

Comment 16 Maxim V Karavaev 2011-09-17 19:51:23 UTC

I'll try to run it on one vcpu, but not sure due to heavy load.

Comment 17 Kevin Fenzi 2011-09-19 00:15:49 UTC

Hit it again with 1 vcpu and e1000. Moved it to 1 vcpu and virtio now. 
Seems to hit about once a day here.

Comment 18 Kevin Fenzi 2011-09-20 15:00:43 UTC

Issue re-occurs with 1vcpu and virtio as well. 

We are going to try a 32bit guest. It seems all the other reports here are with 64 guests?

Comment 19 Maxim V Karavaev 2011-09-20 16:10:59 UTC

Yes, we use 64 bit guests.
As for us, problem vps is working for about 3 days on one vcpu without problems.

But:
1) there is other vps on this host with 3 vcpu and the same kernel. It is working just fine for about 4 months,
2) problem vps worked without problems on 2 vpcu for about 1 month before issue.

Comment 20 Maxim V Karavaev 2011-09-22 11:33:17 UTC

Yes, issue re-occurs with 1vcpu for me too.
So, it doesn't depends on number of vcpus.
Please notice that in all cases there are message in logs: [last unloaded: speedstep_lib].

Comment 21 Kevin Fenzi 2011-09-22 14:53:54 UTC

We rebuilt our affected guest as 32bit and it's been up and working fine for almost 2 days now. 
(With 4 vcpus and virtio). 

So, this could well be a x86_64 specific bug.

Comment 22 Rik van Riel 2011-09-25 16:29:16 UTC

This appears to be a duplicate of bug 683658, which was fixed in kernel-2.6.32-174.el6

Note that this is a HOST side bug. You need to upgrade your host kernel to avoid this bug showing up in guests.

Comment 23 Dor Laor 2011-09-25 20:16:47 UTC

(In reply to comment #22)
> This appears to be a duplicate of bug 683658, which was fixed in
> kernel-2.6.32-174.el6
> 
> Note that this is a HOST side bug. You need to upgrade your host kernel to
> avoid this bug showing up in guests.

Fully agree, I remember this one.
Dear reporters, please try the 6.1.z host kernel - kernel-2.6.32-131.11.1.el6
For the time being I'll close it as a duplicate, please reopen if this kernel does not solve it.

*** This bug has been marked as a duplicate of bug 683658 ***

Comment 24 cocl 2011-10-09 01:59:57 UTC

Hi all.
We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not solved.

Comment 25 cocl 2011-10-09 02:57:59 UTC

Hi all.
We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not solved.

Comment 26 Gleb Natapov 2011-10-09 07:51:57 UTC

(In reply to comment #25)
> Hi all.
> We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not
> solved.

Have you updated _host_ kernel, not a guest one?

Comment 27 Christian Horn 2011-10-17 09:39:27 UTC

(In reply to comment #25)
> We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not
> solved.

From what I see we do not yet have an errata for this.
So the testkernel kernel-2.6.32-131.11.1.el6 has to be installed to fix the problem with 6.1.z .
Another option is to use the rhel6.2 beta kernel, the patch went in there bevore 2.6.32-173 was tagged.

Comment 28 Maxim V Karavaev 2011-10-20 22:42:22 UTC

As for me, upgrading host kernel to 2.6.32-131.12.1.el6.x86_64 solved this issue. At least two weeks without crashes.
Thanks for help!

Comment 29 cocl 2011-10-21 05:26:20 UTC

(In reply to comment #27)
> (In reply to comment #25)
> > We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not
> > solved.
> 
> From what I see we do not yet have an errata for this.
> So the testkernel kernel-2.6.32-131.11.1.el6 has to be installed to fix the
> problem with 6.1.z .
> Another option is to use the rhel6.2 beta kernel, the patch went in there
> bevore 2.6.32-173 was tagged.

Thanks,after updating the host kernel to 2.6.32-131.12.1.el6.x86_64,the issue solved.