Bug 725332
Summary: | (RHEL-6.1 KVM SMP virtio_net) BUG: scheduling while atomic: swapper/0/0x00000100 | ||
---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Warren Togami <wtogami> |
Component: | kernel | Assignee: | Gleb Natapov <gleb> |
Status: | CLOSED DUPLICATE | QA Contact: | Red Hat Kernel QE team <kernel-qe> |
Severity: | urgent | Docs Contact: | |
Priority: | unspecified | ||
Version: | 6.1 | CC: | andy.wallis, chorn, chrisw, herbert.xu, hqucocl, kevin, knoel, max.karavaev, mchristi, mishu, riel, tburke |
Target Milestone: | rc | ||
Target Release: | --- | ||
Hardware: | x86_64 | ||
OS: | Unspecified | ||
Whiteboard: | |||
Fixed In Version: | Doc Type: | Bug Fix | |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2011-09-25 20:16:47 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: |
Description
Warren Togami
2011-07-25 07:40:12 UTC
http://mirror.ancl.hawaii.edu/ This issue is crippling our new Fedora mirror server. Please help! Workaround: Dropping the kvm guest to one VCPU seems to prevent this problem. Warren, could you add some info on the configuration of the system that sees this issue? What kinds of storage are attached? What kind of network interface? How much traffic? Host Hardware ============= HP DL360 G6 Intel Xeon E5540 with 6GB RAM (4 core with hyperthreading) Netgear ReadyNAS 4200 serving 12TB array over iSCSI /dev/sda: Hewlett-Packard Company Smart Array G6 controllers (rev 01) eth0: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) eth1: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) Mirror VM ========= (virtio_net) eth0 http://mirror.ancl.hawaii.edu (virtio_net) eth1 (private interface to NAS) (virtio_blk) / filesystem on local SCSI disk (iscsi) /srv/mirror 12TB ext4 formatted iSCSI accessed via eth1 Workload ======== The mirror runs several rsync processes from cron to sync data from Fedora, EPEL, CentOS, Scientific Linux, The Document Foundation, Debian, Ubuntu and more. Meanwhile, lighttpd, vsftpd and rsync daemon serve mirror content to clients. All mirror traffic traverses both eth0 and eth1 as /srv/mirror content is stored on the ext4 formatted iSCSI NAS array. (In reply to comment #5 > eth0: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) > eth1: Broadcom Corporation NetXtreme II BCM5709 Gigabit Ethernet (rev 20) What iscsi driver are you using? The software iscsi driver, iscsi_tcp, or do your broadcom cards have iscsi offload enabled and are you using bnx2i for offloaded iscsi? By default we will use iscsi_tcp. If when you are logged into the iscsi target you can run iscsiadm -m session -P 3 | grep transport to see if bnx2i is being used. iscsi_tcp. As noted above, the iscsi initiator is running inside the VM, communicating over virtio_net to the iscsi target, so it wouldn't be able to use hardware offloaded iscsi. Note: "BUG: scheduling while atomic: swapper/0/0x10000002" shows also in bug 726877 for 3.1.0-0.rc0... kernels with no virtio_net and no iscsi. (In reply to comment #8) > Note: "BUG: scheduling while atomic: swapper/0/0x10000002" shows also in bug > 726877 for 3.1.0-0.rc0... kernels with no virtio_net and no iscsi. This is most likely a different bug entirely. I switched the RHEL-6.1 VM back to 3 VCPU's, but this time changed virtual eth0 and eth1 from virtio_net to e1000. Several simultaneously streams both to/from the server, traversing eth0 via rsync and httpd, reading and writing /srv/mirror over eth1 to the iSCSI array for the past hour. So far the kernel bug has not triggered. So it would seem this is a bug specific to virtio_net. Riel identified similar bugs fixed in other network drivers, but not libvirt_net? Someone has plan to fix it? With no virtio,the IO performance is very poor Finally seeing this here as well, but oddly on a machine that has no iscsi, only a lot of network traffic. kernel 2.6.32-131.12.1.el6.x86_64 here. kernel BUG: scheduling while atomic: swapper/0/0x00010000 Modules linked in: tun ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter iptable_raw ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6table_raw ip6_tables ipv6 ext3 jbd virtio_balloon virtio_net i2c_piix4 i2c_core ext4 mbcache jbd2 virtio_blk virtio_pci virtio_ring virtio pata_acpi ata_generic ata_piix dm_mod [last unloaded: scsi_wait_scan] Pid: 0, comm: swapper Not tainted 2.6.32-131.12.1.el6.x86_64 #1 Call Trace: [<ffffffff81055cb6>] ? __schedule_bug+0x66/0x70 [<ffffffff814db2e2>] ? thread_return+0x5d9/0x777 [<ffffffff81095085>] ? sched_clock_local+0x25/0x90 [<ffffffff8109e32b>] ? tick_nohz_stop_idle+0x3b/0x50 [<ffffffff81009ebe>] ? cpu_idle+0xee/0x110 [<ffffffff814c318a>] ? rest_init+0x7a/0x80 [<ffffffff81c1df28>] ? start_kernel+0x41d/0x429 [<ffffffff81c1d33a>] ? x86_64_start_reservations+0x125/0x129 [<ffffffff81c1d438>] ? x86_64_start_kernel+0xfa/0x109 Switching the machine to e1000 didn't seem to help. The problem re-occurs every few days. :( Kevin, with virtio_net but one VCPU does it still occur? Not sure. I just set it to 1 vcpu (althought it has e1000 still in place). Will know in a day or two. ;) I've got exactly the same issue on one vps (of the 4 on this host). Host server is Intel SR1690WBR with two Xeon E5606 and 24Gb RAM. VPS has 3 vcores, 8Gb ram and drbd disk. #uname -a Linux v1.xxxx.xx 2.6.32-131.6.1.el6.x86_64 #1 SMP Tue Jul 12 17:14:50 CDT 2011 x86_64 x86_64 x86_64 GNU/Linux Restart fixes this issue for a day or two, but it's not a good solution. I'll try to run it on one vcpu, but not sure due to heavy load. Hit it again with 1 vcpu and e1000. Moved it to 1 vcpu and virtio now. Seems to hit about once a day here. Issue re-occurs with 1vcpu and virtio as well. We are going to try a 32bit guest. It seems all the other reports here are with 64 guests? Yes, we use 64 bit guests. As for us, problem vps is working for about 3 days on one vcpu without problems. But: 1) there is other vps on this host with 3 vcpu and the same kernel. It is working just fine for about 4 months, 2) problem vps worked without problems on 2 vpcu for about 1 month before issue. Yes, issue re-occurs with 1vcpu for me too. So, it doesn't depends on number of vcpus. Please notice that in all cases there are message in logs: [last unloaded: speedstep_lib]. We rebuilt our affected guest as 32bit and it's been up and working fine for almost 2 days now. (With 4 vcpus and virtio). So, this could well be a x86_64 specific bug. This appears to be a duplicate of bug 683658, which was fixed in kernel-2.6.32-174.el6 Note that this is a HOST side bug. You need to upgrade your host kernel to avoid this bug showing up in guests. (In reply to comment #22) > This appears to be a duplicate of bug 683658, which was fixed in > kernel-2.6.32-174.el6 > > Note that this is a HOST side bug. You need to upgrade your host kernel to > avoid this bug showing up in guests. Fully agree, I remember this one. Dear reporters, please try the 6.1.z host kernel - kernel-2.6.32-131.11.1.el6 For the time being I'll close it as a duplicate, please reopen if this kernel does not solve it. *** This bug has been marked as a duplicate of bug 683658 *** Hi all. We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not solved. Hi all. We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not solved. (In reply to comment #25) > Hi all. > We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not > solved. Have you updated _host_ kernel, not a guest one? (In reply to comment #25) > We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not > solved. From what I see we do not yet have an errata for this. So the testkernel kernel-2.6.32-131.11.1.el6 has to be installed to fix the problem with 6.1.z . Another option is to use the rhel6.2 beta kernel, the patch went in there bevore 2.6.32-173 was tagged. As for me, upgrading host kernel to 2.6.32-131.12.1.el6.x86_64 solved this issue. At least two weeks without crashes. Thanks for help! (In reply to comment #27) > (In reply to comment #25) > > We updated the kernel to 2.6.32-131.12.1.el6.x86_64,but the issue was not > > solved. > > From what I see we do not yet have an errata for this. > So the testkernel kernel-2.6.32-131.11.1.el6 has to be installed to fix the > problem with 6.1.z . > Another option is to use the rhel6.2 beta kernel, the patch went in there > bevore 2.6.32-173 was tagged. Thanks,after updating the host kernel to 2.6.32-131.12.1.el6.x86_64,the issue solved. |