Bug 601480
Summary: | RHEL4 guest netdump I/O stall | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 6 | Reporter: | Qian Cai <qcai> | ||||||||||
Component: | qemu-kvm | Assignee: | Amit Shah <amit.shah> | ||||||||||
Status: | CLOSED WORKSFORME | QA Contact: | Virtualization Bugs <virt-bugs> | ||||||||||
Severity: | medium | Docs Contact: | |||||||||||
Priority: | low | ||||||||||||
Version: | 6.0 | CC: | cye, ddumas, mkenneth, mst, tburke, virt-maint | ||||||||||
Target Milestone: | rc | ||||||||||||
Target Release: | --- | ||||||||||||
Hardware: | All | ||||||||||||
OS: | Linux | ||||||||||||
Whiteboard: | |||||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||||
Doc Text: | Story Points: | --- | |||||||||||
Clone Of: | Environment: | ||||||||||||
Last Closed: | 2010-11-12 07:19:05 UTC | Type: | --- | ||||||||||
Regression: | --- | Mount Type: | --- | ||||||||||
Documentation: | --- | CRM: | |||||||||||
Verified Versions: | Category: | --- | |||||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||||
Embargoed: | |||||||||||||
Bug Depends On: | |||||||||||||
Bug Blocks: | 524819, 580953 | ||||||||||||
Attachments: |
|
e1000, ne2k_pci all have the same problem. (In reply to comment #0) > kernel-2.6.32-19.el6.x86_64 Correction - kernel-2.6.32-33.el6.x86_64 Why is it a blocker? Does it work for rhel5 guest? How do you set up netdump? Both, server and client? From the guest xml it looks like you're using the default 'user' networking, which doesn't allow outside connections into the guest. Please try with tap. This could be the reason of the stall: the server could be opening a connection with the client in the guest and since that doesn't happen, the dump doesn't proceed. *** Bug 603413 has been marked as a duplicate of this bug. *** After setup netdump server and client on the same subnet with the NAT mode, I got a single successfully dump with virtio driver on RHEL4 guest and the rest of the attempts failed. I think there are possible 4 puzzles need to be solved, 1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted the client immediately after triggering a crash with empty vmcore-incomplete. 2) RHEL3 guest (e1000) hung during netdump - will use bug 603413 to track. 3) RHEL3 guest virtio driver did not support netconsole - bug 603420. 4) netdump not working in NAT mode with server/client not in the same subnet while netdump service start was working fine. This is also can a document fix if not yet documented anywhere. Created attachment 424668 [details]
log file
Immediately reboot during the netdump.
# ls -l
total 44
-rw------- 1 netdump netdump 35953 Jun 17 03:59 log
-rw------- 1 netdump netdump 0 Jun 17 03:59 vmcore-incomplete
(In reply to comment #8) > After setup netdump server and client on the same subnet with the NAT mode, I > got a single successfully dump with virtio driver on RHEL4 guest and the rest > of the attempts failed. OK, so looks like the network topology does play a role. It could be a bug in the qemu-kvm network setup, or, as you say in point 4 below, a netdump limitation. We need some more information, as requested by people who are more knowledgable about netdump. Can you please provide us with the logs from the machine where the netdump server is run? Those logs could provide some more information on where it's stalling and why. There's also a possibility of the guest locking up -- can you check with RHEL6 guests and the different nic models to see if there are any recent fixes that might need backporting to these older guests (or if we need to fix those bugs for RHEL6 as well)? > I think there are possible 4 puzzles need to be solved, > > 1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted > the > client immediately after triggering a crash with empty vmcore-incomplete. Do you mean one run with virtio-net resulted in a successful dump, but more runs with virtio-net didn't succeed? That would indicate it's a guest lockup bug somewhere. > 2) RHEL3 guest (e1000) hung during netdump - will use bug 603413 to track. > > 3) RHEL3 guest virtio driver did not support netconsole - bug 603420. > > 4) netdump not working in NAT mode with server/client not in the same subnet > while netdump service start was working fine. This is also can a document > fix if not yet documented anywhere. I'll ask people more informed about netdump if this is a necessity. > Can you please provide us with the logs from the machine where the netdump > server is run? Those logs could provide some more information on where it's > stalling and why. Which logs do you need -- /var/log/dmesg /var/log/messages from the server? > There's also a possibility of the guest locking up -- can you check with RHEL6 > guests and the different nic models to see if there are any recent fixes that > might need backporting to these older guests (or if we need to fix those bugs > for RHEL6 as well)? Hmm, RHEL6 guest did not support netdump. Not sure what to test there. > Do you mean one run with virtio-net resulted in a successful dump, but more > runs with virtio-net didn't succeed? Correct. (In reply to comment #11) > > Can you please provide us with the logs from the machine where the netdump > > server is run? Those logs could provide some more information on where it's > > stalling and why. > Which logs do you need -- /var/log/dmesg /var/log/messages from the server? /var/log/messages > > There's also a possibility of the guest locking up -- can you check with RHEL6 > > guests and the different nic models to see if there are any recent fixes that > > might need backporting to these older guests (or if we need to fix those bugs > > for RHEL6 as well)? > Hmm, RHEL6 guest did not support netdump. Not sure what to test there. Right; it's kdump on RHEL6. > > Do you mean one run with virtio-net resulted in a successful dump, but more > > runs with virtio-net didn't succeed? > Correct. OK. Also, netdump should work properly when the server and the client are on different subnets. However, since we know this worked once when both were on the same subnet, let's try to narrow down and fix this problem first. When this is fixed, you can try with the server and client on different subnets. If it doesn't work, it's a separate bug. Created attachment 424980 [details]
server side's log when the client hung during the netdump
Hmm, virtio_net rhel4 guest also hung as well.
> 1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted
> the client immediately after triggering a crash with empty vmcore-incomplete.
This one is actual a false alarm, it was rebooted immediately due to running out of server disk space. However, as mentioned above, it could also occasionally hung which I can't reproduce it reliably so far, so we can probably use this BZ to track NAT mode different subnets issue.
It seems that qemu allows packets up to 16384. Please try with mtu 16384. Looks like still the same problem after modifying the setting in the host, # ifconfig vnet0 mtu 16384 # ifconfig ... vnet0 Link encap:Ethernet HWaddr 16:5A:E7:4B:35:77 inet6 addr: fe80::145a:e7ff:fe4b:3577/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:16384 Metric:1 RX packets:21022 errors:0 dropped:0 overruns:0 frame:0 TX packets:5443 errors:0 dropped:0 overruns:0 carrier:0 collisions:0 txqueuelen:500 RX bytes:22422154 (21.3 MiB) TX bytes:360360 (351.9 KiB) ... Can you test with newer qemu-kvm versions? A fix for a similar bug was pushed as part of a different bug report. Tested with latest build: ================================================================= Host: ibm-hs21-01.rhts.eng.nay.redhat.com/RHEL6/KVM ------------------------------------------------------- [root@ibm-hs21-01 ~]# rpm -q kernel qemu-kvm libvirt kernel-2.6.32-81.el6.x86_64 qemu-kvm-0.12.1.2-2.118.el6.x86_64 libvirt-0.8.1-27.el6.x86_64 Guest: bootp-66-87-249.rhts.eng.nay.redhat.com/RHEL4-U9 ------------------------------------------------------- [root@bootp-66-87-249 ~]# rpm -q kernel netdump kernel-2.6.9-90.EL netdump-0.7.16-15 Test result: ------------------------------------------------------- e1000 success fast ne2k_pci success slow pcnet success fast rtl8139 success fast virtio success fast Created attachment 459898 [details]
guest xml
Thanks for testing. Since virtio, e1000 and rtl8139 (the supported nic cards) work fine, I'll close this bug report. |
Created attachment 422017 [details] guest xml Description of problem: Looks like the I/O was stalled during the netdump for a RHEL4.8 guest under RHEL6 KVM. virtio NIC did not make any difference either. [...network console startup...] SysRq : Crashing the kernel by request Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: <ffffffff8023f54c>{sysrq_handle_crash+0} PML4 3a7d1067 PGD 3a107067 Oops: 0002 [1] PMD 0 SMP CPU 1 Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core cpufreq_powersave dm_mirror dm_mod button battery ac uhci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_timer snd_pcm snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore virtio_net 8139cp mii floppy ext3 jbd virtio_blk virtio_pci virtio virtio_ring sd_mod scsi_mod Pid: 4127, comm: bash Not tainted 2.6.9-89.ELsmp RIP: 0010:[<ffffffff8023f54c>] <ffffffff8023f54c>{sysrq_handle_crash+0} RSP: 0018:000001003942feb0 EFLAGS: 00010012 RAX: 000000000000001f RBX: ffffffff80414220 RCX: ffffffff803f66e8 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063 RBP: 0000000000000063 R08: ffffffff803f66e8 R09: ffffffff80414220 R10: 0000000100000000 R11: ffffffff8011f688 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000246 FS: 0000002a95aac6e0(0000) GS:ffffffff80504580(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 0000000000000000 CR3: 000000003f128000 CR4: 00000000000006e0 Process bash (pid: 4127, threadinfo 000001003942e000, task 000001003edd3030) Stack: ffffffff8023f70f 0000000000000000 000001003942e000 0000000000000002 000001003942ff50 0000000000000002 0000002a98a02000 0000000000000000 ffffffff801b46e1 0000000000000048 Call Trace:<ffffffff8023f70f>{__handle_sysrq+115} <ffffffff801b46e1>{write_sysrq_trigger+43} <ffffffff8017c4ce>{vfs_write+207} <ffffffff8017c5b6>{sys_write+69} <ffffffff801102f6>{system_call+126} Code: c6 04 25 00 00 00 00 00 c3 e9 b8 e3 f3 ff e9 49 32 f4 ff 48 RIP <ffffffff8023f54c>{sysrq_handle_crash+0} RSP <000001003942feb0> CR2: 0000000000000000 CPU#0 is frozen. CPU#1 is executing netdump. < netdump activated - performing handshake with the server. > NETDUMP START! < handshake completed - listening for dump requests. > 0(4295448000)/ After a long time, it was still dumping, and the size did not increase, # ls -l total 9132 -rw------- 1 netdump netdump 2017 Jun 8 00:14 log -rw------- 1 netdump netdump 20975616 Jun 8 00:22 vmcore-incomplete Version-Release number of selected component (if applicable): qemu-kvm-0.12.1.2-2.69.el6.x86_64 libvirt-0.8.1-7.el6.x86_64 kernel-2.6.32-19.el6.x86_64 How reproducible: always Steps to Reproduce: 1. setup a RHEL4.8 x86_64 guest on my Intel based x200 laptop. 2. setup netdump and crash the guest. Actual results: Keep dumping for a long time, and vmcore size did not increase further. Expected results: Complete the netdump with a full vmcore.