Bug 601480

Summary: RHEL4 guest netdump I/O stall
Product: Red Hat Enterprise Linux 6 Reporter: Qian Cai <qcai>
Component: qemu-kvmAssignee: Amit Shah <amit.shah>
Status: CLOSED WORKSFORME QA Contact: Virtualization Bugs <virt-bugs>
Severity: medium Docs Contact:
Priority: low    
Version: 6.0CC: cye, ddumas, mkenneth, mst, tburke, virt-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-11-12 07:19:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 524819, 580953    
Attachments:
Description Flags
guest xml
none
log file
none
server side's log when the client hung during the netdump
none
guest xml none

Description Qian Cai 2010-06-08 04:26:44 UTC
Created attachment 422017 [details]
guest xml

Description of problem:
Looks like the I/O was stalled during the netdump for a RHEL4.8 guest under RHEL6 KVM. virtio NIC did not make any difference either.

[...network console startup...]
SysRq : Crashing the kernel by request
Unable to handle kernel NULL pointer dereference at 0000000000000000 RIP: 
<ffffffff8023f54c>{sysrq_handle_crash+0}
PML4 3a7d1067 PGD 3a107067 
Oops: 0002 [1] PMD 0 SMP 
CPU 1 
Modules linked in: md5 ipv6 parport_pc lp parport netconsole netdump autofs4 sunrpc iptable_filter ip_tables ds yenta_socket pcmcia_core cpufreq_powersave dm_mirror dm_mod button battery ac uhci_hcd snd_intel8x0 snd_ac97_codec snd_pcm_oss snd_mixer_oss snd_timer snd_pcm snd_page_alloc snd_mpu401_uart snd_rawmidi snd_seq_device snd soundcore virtio_net 8139cp mii floppy ext3 jbd virtio_blk virtio_pci virtio virtio_ring sd_mod scsi_mod
Pid: 4127, comm: bash Not tainted 2.6.9-89.ELsmp
RIP: 0010:[<ffffffff8023f54c>] <ffffffff8023f54c>{sysrq_handle_crash+0}
RSP: 0018:000001003942feb0  EFLAGS: 00010012
RAX: 000000000000001f RBX: ffffffff80414220 RCX: ffffffff803f66e8
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000063
RBP: 0000000000000063 R08: ffffffff803f66e8 R09: ffffffff80414220
R10: 0000000100000000 R11: ffffffff8011f688 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000006 R15: 0000000000000246
FS:  0000002a95aac6e0(0000) GS:ffffffff80504580(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 000000003f128000 CR4: 00000000000006e0
Process bash (pid: 4127, threadinfo 000001003942e000, task 000001003edd3030)
Stack: ffffffff8023f70f 0000000000000000 000001003942e000 0000000000000002 
       000001003942ff50 0000000000000002 0000002a98a02000 0000000000000000 
       ffffffff801b46e1 0000000000000048 
Call Trace:<ffffffff8023f70f>{__handle_sysrq+115} <ffffffff801b46e1>{write_sysrq_trigger+43} 
       <ffffffff8017c4ce>{vfs_write+207} <ffffffff8017c5b6>{sys_write+69} 
       <ffffffff801102f6>{system_call+126} 

Code: c6 04 25 00 00 00 00 00 c3 e9 b8 e3 f3 ff e9 49 32 f4 ff 48 
RIP <ffffffff8023f54c>{sysrq_handle_crash+0} RSP <000001003942feb0>
CR2: 0000000000000000
CPU#0 is frozen.
CPU#1 is executing netdump.
< netdump activated - performing handshake with the server. >
NETDUMP START!
< handshake completed - listening for dump requests. >
0(4295448000)/

After a long time, it was still dumping, and the size did not increase,

# ls -l
total 9132
-rw-------  1 netdump netdump     2017 Jun  8 00:14 log
-rw-------  1 netdump netdump 20975616 Jun  8 00:22 vmcore-incomplete

Version-Release number of selected component (if applicable):
qemu-kvm-0.12.1.2-2.69.el6.x86_64
libvirt-0.8.1-7.el6.x86_64
kernel-2.6.32-19.el6.x86_64

How reproducible:
always

Steps to Reproduce:
1. setup a RHEL4.8 x86_64 guest on my Intel based x200 laptop.
2. setup netdump and crash the guest.
  
Actual results:
Keep dumping for a long time, and vmcore size did not increase further.

Expected results:
Complete the netdump with a full vmcore.

Comment 2 Qian Cai 2010-06-08 04:54:03 UTC
e1000, ne2k_pci all have the same problem.

Comment 3 Qian Cai 2010-06-08 05:31:25 UTC
(In reply to comment #0)
> kernel-2.6.32-19.el6.x86_64
Correction - kernel-2.6.32-33.el6.x86_64

Comment 4 Dor Laor 2010-06-09 12:26:25 UTC
Why is it a blocker? Does it work for rhel5 guest?

Comment 6 Amit Shah 2010-06-16 14:56:42 UTC
How do you set up netdump? Both, server and client?

From the guest xml it looks like you're using the default 'user' networking, which doesn't allow outside connections into the guest. Please try with tap.

This could be the reason of the stall: the server could be opening a connection with the client in the guest and since that doesn't happen, the dump doesn't proceed.

Comment 7 Michael S. Tsirkin 2010-06-16 18:54:14 UTC
*** Bug 603413 has been marked as a duplicate of this bug. ***

Comment 8 Qian Cai 2010-06-17 04:19:09 UTC
After setup netdump server and client on the same subnet with the NAT mode, I got a single successfully dump with virtio driver on RHEL4 guest and the rest of the attempts failed.

I think there are possible 4 puzzles need to be solved,

1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted the
   client immediately after triggering a crash with empty vmcore-incomplete.

2) RHEL3 guest (e1000) hung during netdump - will use bug 603413 to track.

3) RHEL3 guest virtio driver did not support netconsole - bug 603420.

4) netdump not working in NAT mode with server/client not in the same subnet
   while netdump service start was working fine. This is also can a document
   fix if not yet documented anywhere.

Comment 9 Qian Cai 2010-06-17 04:25:49 UTC
Created attachment 424668 [details]
log file

Immediately reboot during the netdump.

# ls -l
total 44
-rw-------  1 netdump netdump 35953 Jun 17 03:59 log
-rw-------  1 netdump netdump     0 Jun 17 03:59 vmcore-incomplete

Comment 10 Amit Shah 2010-06-17 06:01:27 UTC
(In reply to comment #8)
> After setup netdump server and client on the same subnet with the NAT mode, I
> got a single successfully dump with virtio driver on RHEL4 guest and the rest
> of the attempts failed.

OK, so looks like the network topology does play a role. It could be a bug in the qemu-kvm network setup, or, as you say in point 4 below, a netdump limitation.

We need some more information, as requested by people who are more knowledgable about netdump.

Can you please provide us with the logs from the machine where the netdump server is run? Those logs could provide some more information on where it's stalling and why.

There's also a possibility of the guest locking up -- can you check with RHEL6 guests and the different nic models to see if there are any recent fixes that might need backporting to these older guests (or if we need to fix those bugs for RHEL6 as well)?

> I think there are possible 4 puzzles need to be solved,
> 
> 1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted
> the
>    client immediately after triggering a crash with empty vmcore-incomplete.

Do you mean one run with virtio-net resulted in a successful dump, but more runs with virtio-net didn't succeed?

That would indicate it's a guest lockup bug somewhere.

> 2) RHEL3 guest (e1000) hung during netdump - will use bug 603413 to track.
> 
> 3) RHEL3 guest virtio driver did not support netconsole - bug 603420.
> 
> 4) netdump not working in NAT mode with server/client not in the same subnet
>    while netdump service start was working fine. This is also can a document
>    fix if not yet documented anywhere.    

I'll ask people more informed about netdump if this is a necessity.

Comment 11 Qian Cai 2010-06-17 11:06:27 UTC
> Can you please provide us with the logs from the machine where the netdump
> server is run? Those logs could provide some more information on where it's
> stalling and why.
Which logs do you need -- /var/log/dmesg /var/log/messages from the server?

> There's also a possibility of the guest locking up -- can you check with RHEL6
> guests and the different nic models to see if there are any recent fixes that
> might need backporting to these older guests (or if we need to fix those bugs
> for RHEL6 as well)?
Hmm, RHEL6 guest did not support netdump. Not sure what to test there.

> Do you mean one run with virtio-net resulted in a successful dump, but more
> runs with virtio-net didn't succeed?
Correct.

Comment 12 Amit Shah 2010-06-17 11:30:54 UTC
(In reply to comment #11)
> > Can you please provide us with the logs from the machine where the netdump
> > server is run? Those logs could provide some more information on where it's
> > stalling and why.
> Which logs do you need -- /var/log/dmesg /var/log/messages from the server?

/var/log/messages

> > There's also a possibility of the guest locking up -- can you check with RHEL6
> > guests and the different nic models to see if there are any recent fixes that
> > might need backporting to these older guests (or if we need to fix those bugs
> > for RHEL6 as well)?
> Hmm, RHEL6 guest did not support netdump. Not sure what to test there.

Right; it's kdump on RHEL6.

> > Do you mean one run with virtio-net resulted in a successful dump, but more
> > runs with virtio-net didn't succeed?
> Correct.    

OK.

Also, netdump should work properly when the server and the client are on different subnets. However, since we know this worked once when both were on the same subnet, let's try to narrow down and fix this problem first. When this is fixed, you can try with the server and client on different subnets. If it doesn't work, it's a separate bug.

Comment 13 Qian Cai 2010-06-17 23:52:25 UTC
Created attachment 424980 [details]
server side's log when the client hung during the netdump

Hmm, virtio_net rhel4 guest also hung as well.

Comment 14 Qian Cai 2010-06-18 02:10:48 UTC
> 1) RHEL4 guest (rtl8139, virtio, e1000) and RHEL4 (rtl8139) usually rebooted
> the client immediately after triggering a crash with empty vmcore-incomplete.

This one is actual a false alarm, it was rebooted immediately due to running out of server disk space. However, as mentioned above, it could also occasionally hung which I can't reproduce it reliably so far, so we can probably use this BZ to track NAT mode different subnets issue.

Comment 15 Michael S. Tsirkin 2010-06-20 16:45:27 UTC
It seems that qemu allows packets up to 16384.
Please try with mtu 16384.

Comment 16 Qian Cai 2010-06-22 03:08:13 UTC
Looks like still the same problem after modifying the setting in the host,
# ifconfig vnet0 mtu 16384
# ifconfig
...
vnet0     Link encap:Ethernet  HWaddr 16:5A:E7:4B:35:77  
          inet6 addr: fe80::145a:e7ff:fe4b:3577/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:16384  Metric:1
          RX packets:21022 errors:0 dropped:0 overruns:0 frame:0
          TX packets:5443 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:500 
          RX bytes:22422154 (21.3 MiB)  TX bytes:360360 (351.9 KiB)
...

Comment 18 Amit Shah 2010-11-10 10:03:56 UTC
Can you test with newer qemu-kvm versions?  A fix for a similar bug was pushed as part of a different bug report.

Comment 19 Chao Ye 2010-11-12 02:40:50 UTC
Tested with latest build:
=================================================================
Host: ibm-hs21-01.rhts.eng.nay.redhat.com/RHEL6/KVM
-------------------------------------------------------
[root@ibm-hs21-01 ~]# rpm -q kernel qemu-kvm libvirt
kernel-2.6.32-81.el6.x86_64
qemu-kvm-0.12.1.2-2.118.el6.x86_64
libvirt-0.8.1-27.el6.x86_64


Guest: bootp-66-87-249.rhts.eng.nay.redhat.com/RHEL4-U9
-------------------------------------------------------
[root@bootp-66-87-249 ~]# rpm -q kernel netdump
kernel-2.6.9-90.EL
netdump-0.7.16-15

Test result:
-------------------------------------------------------
e1000       success    fast
ne2k_pci    success    slow
pcnet       success    fast
rtl8139     success    fast
virtio      success    fast

Comment 20 Chao Ye 2010-11-12 02:41:27 UTC
Created attachment 459898 [details]
guest xml

Comment 21 Amit Shah 2010-11-12 07:19:05 UTC
Thanks for testing.

Since virtio, e1000 and rtl8139 (the supported nic cards) work fine, I'll close this bug report.