Bug 737891

Summary: booting a VM sometimes fails
Product: Red Hat Enterprise Linux 6 Reporter: David Juran <djuran>
Component: gpxeAssignee: jason wang <jasowang>
Status: CLOSED WORKSFORME QA Contact: Virtualization Bugs <virt-bugs>
Severity: high Docs Contact:
Priority: high    
Version: 6.1CC: akong, cpelland, hpaasova, iheim, juzhang, leiwang, orenault, rhod, shuang, tburke
Target Milestone: rc   
Target Release: ---   
Hardware: Unspecified   
OS: Unspecified   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2012-07-23 07:10:07 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
dump of when the boot suceeds
none
dump of when boot fails
none
Dump from the NIC of the VM ( as seen from the hypervisor )
none
Dump from the bridge of the hypervisor
none
Dump from the tftp server
none
dhcpd.conf file
none
Dump captured form the bridge with gPXE testing ROM
none
Screenshot from trying to boot with debug rom
none
VirtIO PXE log
none
VirtIO PXE tcpdump from the bridge
none
DHCP server log
none
tcpdump from br1
none
VM console screencapture
none
br1 tcpdump (20111024)
none
vnet0 tcpdump (20111024)
none
virtio-rom log output (20111024)
none
ifconfig vnet0 (20111024) none

Description David Juran 2011-09-13 10:55:59 UTC
Description of problem:
I've noticed that PXE booting a VM on a RHEL6 host sometimes (most of the times) fails. When it fails, the boot process stops in the PXE screen with the message:

 DHCP (net0 00:1a:4a_09:d9:01)............................. Connection timed out (0x4c106035)


Attached are two (to me) identical dumps, both captured by sniffing the bridge of the host. The difference is that in one case the PXE-boot succeeded and in one case it failed. 


Version-Release number of selected component (if applicable):
gpxe-roms-qemu-0.9.7-6.7.el6

How reproducible:
Most times


Additional info:
I've tried with both an virtio intervace and an RTL8139 and it makes no difference.
I've also verified that the packets actually do reach the vnet0 interface.

Comment 1 David Juran 2011-09-13 10:58:32 UTC
Created attachment 522888 [details]
dump of when the boot suceeds

Comment 2 David Juran 2011-09-13 10:59:13 UTC
Created attachment 522889 [details]
dump of when boot fails

Comment 4 RHEL Program Management 2011-09-13 11:28:17 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 5 RHEL Program Management 2011-09-13 11:48:08 UTC
This request was evaluated by Red Hat Product Management for
inclusion in the current release of Red Hat Enterprise Linux.
Because the affected component is not scheduled to be updated
in the current release, Red Hat is unfortunately unable to
address this request at this time. Red Hat invites you to
ask your support representative to propose this request, if
appropriate and relevant, in the next release of Red Hat
Enterprise Linux. If you would like it considered as an
exception in the current release, please ask your support
representative.

Comment 6 Olivier Renault 2011-09-14 13:32:55 UTC
I am having a similar issue to David ( Connection timeout when trying to PXE ).

a difference w/ David issue, is that if I move my VM NIC from Virtio to e1000 or rtl everything works fine.

Pls find some tcpdump from different part
s

Comment 7 Olivier Renault 2011-09-14 13:34:14 UTC
Created attachment 523149 [details]
Dump from the NIC of the VM ( as seen from the hypervisor )

Comment 8 Olivier Renault 2011-09-14 13:34:49 UTC
Created attachment 523150 [details]
Dump from the bridge of the hypervisor

Comment 9 Olivier Renault 2011-09-14 13:35:24 UTC
Created attachment 523151 [details]
Dump from the tftp server

Comment 10 jason wang 2011-09-19 03:16:07 UTC
Hello Olivier and David:

Not sure whether your issue is the same as https://bugzilla.redhat.com/show_bug.cgi?id=737448. If possible, would you please try the pxe room in fileshare.englab.nay.redhat.com/pub/section2/kvm/akong/virtio-net.rom to check whether it solves your problem?

Thanks

Comment 11 Olivier Renault 2011-09-19 10:24:16 UTC
i ve tried w/ the new virtio-net.rom and I still have the same issue ( tftp timeout )

Comment 12 jason wang 2011-09-20 05:42:08 UTC
(In reply to comment #11)
> i ve tried w/ the new virtio-net.rom and I still have the same issue ( tftp
> timeout )

Thanks for the test, and from the dump file you provides, it seems tftp server recevice ack of block 0 from client but didn't respond for some reasons. So:

1 Please try boot into a guest an use tftp cmd to check whether you can get the pxelinux.0 from the server.
2 Have a look at syslog of tftp server when pxe boot fails to see whether tftpd complains.
3 The tftp server configuration.

Thanks

Comment 13 David Juran 2011-09-20 07:19:03 UTC
Hello.

Looking at Olivier's dumps, I believe his issue is different from the one I opened this Bz about. In Olivier's dump, the boot proceeds past DHCP while in the case this Bz is handling, the boot process never gets past DHCP (see https://bugzilla.redhat.com/attachment.cgi?id=522889)

Also, I just tried fileshare.englab.nay.redhat.com/pub/section2/kvm/akong/virtio-net.rom mentioned in comment #10 and PXE now works. Of course this doesn't prove anything since the problem wasn't 100% reproducible in the first place, but at least I managed to boot the VM three consecutive times.

Anyhow, I suggest we leave wait a few days and if the problem of booting won't re-occur, I guess we can close this Bz as a dup of Bz 737448.

Comment 14 Olivier Renault 2011-09-20 11:31:51 UTC
Hi Jason,

Please find the answer to your question below. ( let me know, if you would like me to open a new BZ as it seems both issues are unrelated )

1 Please try boot into a guest an use tftp cmd to check whether you can get the
pxelinux.0 from the server.
Ok, it works without pb

2 Have a look at syslog of tftp server when pxe boot fails to see whether tftpd
complains.
It looks normal to me, pls find the extract below. 

Sep 20 13:29:30 satellite dhcpd: DHCPDISCOVER from 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:30 satellite dhcpd: DHCPOFFER on 10.3.118.18 to 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:31 satellite dhcpd: uid lease 10.3.118.231 for client 00:1a:4a:03:76:00 is duplicate on 10.3.118/24
Sep 20 13:29:31 satellite dhcpd: DHCPDISCOVER from 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:31 satellite dhcpd: DHCPOFFER on 10.3.118.18 to 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:33 satellite dhcpd: uid lease 10.3.118.231 for client 00:1a:4a:03:76:00 is duplicate on 10.3.118/24
Sep 20 13:29:33 satellite dhcpd: DHCPREQUEST for 10.3.118.18 (10.3.118.65) from 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:33 satellite dhcpd: DHCPACK on 10.3.118.18 to 00:1a:4a:03:76:00 via eth0
Sep 20 13:29:34 satellite xinetd[12489]: START: tftp pid=6889 from=10.3.118.18

3 The tftp server configuration.
Pls, find my dhcpd.conf attached. FYI, I am using cobbler on a satellite server. Also pls note that tftp works perfectly for Physical server or VM using rtl or e1000 NIC.

Comment 15 Olivier Renault 2011-09-20 11:34:36 UTC
Created attachment 524011 [details]
dhcpd.conf file

Pls find my dhcpd.conf file.

Regards,
Olivier

Comment 16 David Juran 2011-09-21 10:48:52 UTC
Olivier: The dumps I'm encountering look quite different from yours, so I'm guessing you issue is different. If you agree, please do open a seperate bz for your issue.

Regarding the original issue, with the rom from http://fileshare.englab.nay.redhat.com/pub/section2/kvm/akong/virtio-net.rom today the booting fails. The dumps (attached, captured on the bridge) show that the conversation goes a bit further but not as far as to TFTP.

Comment 17 David Juran 2011-09-21 10:50:15 UTC
Created attachment 524184 [details]
Dump captured form the bridge with gPXE testing ROM

Comment 18 jason wang 2011-09-22 06:46:02 UTC
(In reply to comment #14)
> Hi Jason,
> 
> Please find the answer to your question below. ( let me know, if you would like
> me to open a new BZ as it seems both issues are unrelated )
> 
> 1 Please try boot into a guest an use tftp cmd to check whether you can get the
> pxelinux.0 from the server.
> Ok, it works without pb
> 
> 2 Have a look at syslog of tftp server when pxe boot fails to see whether tftpd
> complains.
> It looks normal to me, pls find the extract below. 
> 
> Sep 20 13:29:30 satellite dhcpd: DHCPDISCOVER from 00:1a:4a:03:76:00 via eth0
> Sep 20 13:29:30 satellite dhcpd: DHCPOFFER on 10.3.118.18 to 00:1a:4a:03:76:00
> via eth0
> Sep 20 13:29:31 satellite dhcpd: uid lease 10.3.118.231 for client
> 00:1a:4a:03:76:00 is duplicate on 10.3.118/24
> Sep 20 13:29:31 satellite dhcpd: DHCPDISCOVER from 00:1a:4a:03:76:00 via eth0
> Sep 20 13:29:31 satellite dhcpd: DHCPOFFER on 10.3.118.18 to 00:1a:4a:03:76:00
> via eth0
> Sep 20 13:29:33 satellite dhcpd: uid lease 10.3.118.231 for client
> 00:1a:4a:03:76:00 is duplicate on 10.3.118/24
> Sep 20 13:29:33 satellite dhcpd: DHCPREQUEST for 10.3.118.18 (10.3.118.65) from
> 00:1a:4a:03:76:00 via eth0
> Sep 20 13:29:33 satellite dhcpd: DHCPACK on 10.3.118.18 to 00:1a:4a:03:76:00
> via eth0
> Sep 20 13:29:34 satellite xinetd[12489]: START: tftp pid=6889 from=10.3.118.18
> 
> 3 The tftp server configuration.
> Pls, find my dhcpd.conf attached. FYI, I am using cobbler on a satellite
> server. Also pls note that tftp works perfectly for Physical server or VM using
> rtl or e1000 NIC.

Thanks Olivier, as David said, this looks like a different bug, please open a new one.

Comment 19 jason wang 2011-09-22 06:58:27 UTC
(In reply to comment #16)
> Olivier: The dumps I'm encountering look quite different from yours, so I'm
> guessing you issue is different. If you agree, please do open a seperate bz for
> your issue.
> 
> Regarding the original issue, with the rom from
> http://fileshare.englab.nay.redhat.com/pub/section2/kvm/akong/virtio-net.rom
> today the booting fails. The dumps (attached, captured on the bridge) show that
> the conversation goes a bit further but not as far as to TFTP.

Hello David:

Form the dump file you provides, I'm not sure whether this does matter, but see something interesting:

1. looks like there are two different dhcp servers (10.9.163.2 and 10.9.163.3), and they try to assign the same ip address(10.9.141.246) to the client. Is this the configuration you want to use?
2. And the server indentifier used in server replies are 10.8.136.105 which is different from server ip (10.9.163.2 and 10.9.163.3), does your tftp/dhcp server have multiple network interfaces?

Thanks

Comment 20 David Juran 2011-09-22 07:40:48 UTC
Hello Jason

So the DHCP (and TFTP server) really is 10.8.136.105 (it's a RHEL5 machine). It is however on a different subnet from the VM hence the Juniper router (acting as gateway on 10.9.163.1) is used as a DHCP forwarder.
 Now this Juniper router really is a cluster of two routers (.2 and .3) which use the gateway IP (.1) as a floating IP (VIP). It seems however that the implementation of the DHCP forwarder in Juniper makes both nodes of the Juniper cluster to forward the DHCP individually and that's why we see the two replies.

Apart from that it looks weird, I don't _think_ it should pose a problem that the reply comes twice. Also, this configuration works fine on physical hardware (HP Proliant)

Comment 21 jason wang 2011-09-22 11:44:58 UTC
(In reply to comment #20)
> Hello Jason
> 
> So the DHCP (and TFTP server) really is 10.8.136.105 (it's a RHEL5 machine). It
> is however on a different subnet from the VM hence the Juniper router (acting
> as gateway on 10.9.163.1) is used as a DHCP forwarder.
>  Now this Juniper router really is a cluster of two routers (.2 and .3) which
> use the gateway IP (.1) as a floating IP (VIP). It seems however that the
> implementation of the DHCP forwarder in Juniper makes both nodes of the Juniper
> cluster to forward the DHCP individually and that's why we see the two replies.
> 
> Apart from that it looks weird, I don't _think_ it should pose a problem that
> the reply comes twice. Also, this configuration works fine on physical hardware
> (HP Proliant)

Hello David:

Thanks for the information, and if possible, would you please try the rom with debugging output in http://fileshare.englab.nay.redhat.com/pub/section2/kvm/jasonwang/virtio-net.rom to see what happens when dhcp session fail?

Thanks.

Comment 22 David Juran 2011-09-23 08:16:08 UTC
Hello Jason.

Screenshot attached

Comment 23 David Juran 2011-09-23 08:17:51 UTC
Created attachment 524550 [details]
Screenshot from trying to boot with debug rom

Comment 24 Suqin Huang 2011-10-09 04:55:19 UTC
Hi David,
Can you test this bug when it's fixed, I can not reproduce it.

Thanks
Suqin

Comment 25 jason wang 2011-10-10 03:02:26 UTC
(In reply to comment #23)
> Created attachment 524550 [details]
> Screenshot from trying to boot with debug rom

Hello David,

Is the environment met this issue can be used by me for debugging? Thanks

Comment 28 jason wang 2011-10-13 05:51:19 UTC
Hi, in order to narrow doen the problem, please try the following two roms to see whether it helps:


1 upstream ipxe rom with debugging enabled: https://bugzilla.redhat.com/attachment.cgi?id=527668, please collect the debug infomation from guest serial

2 our gpxe with upstream virtio-net driver: https://bugzilla.redhat.com/attachment.cgi?id=527821

Thanks

Comment 29 jason wang 2011-10-18 10:21:21 UTC
Hello David, Henri:

Since we are close to snapshot, is there any update on this?

Thanks

Comment 30 Henri Paasovaara 2011-10-21 09:07:51 UTC
Created attachment 529458 [details]
VirtIO PXE log

Log from serial port when using virtio-net ROM attached previously

Comment 31 Henri Paasovaara 2011-10-21 09:10:20 UTC
Created attachment 529459 [details]
VirtIO PXE tcpdump from the bridge

Tcpdump from bridge when using virtio-net ROM attached previously

Comment 32 Henri Paasovaara 2011-10-21 09:14:02 UTC
These two attached files were captured from VM serial port and bridge when PXE booting over network.

Comment 33 jason wang 2011-10-21 09:23:30 UTC
(In reply to comment #32)
> These two attached files were captured from VM serial port and bridge when PXE
> booting over network.

Hi Henri:

Thanks for the testing. Unfortunately, the debug rom does not provide much more valuable information as we even did not receive an ACK of DHCP discovery which is even worse than the what David see. 

In order to see what happens, please check and provides:

1 The host network configuration (interfaces, bonding or vlans ...) and make sure the forward delay of bridge is zero checking (/sys/class/net/${the bridge used by guest}/bridge/forward_delay)
2 qemu command line
3 dhcp server log ( usually /var/log/messages )
4 Please also try to dump in dhcp server to see what happens on server
5 Try to test without vhost by doing modprobe -r vhost_net before starting virt-manager

Thanks

Comment 34 Henri Paasovaara 2011-10-21 11:10:55 UTC
Jason

More updated log files being attached at next posts.

1. /sys/class/net/br1/bridge/forward_delay is/was zero

2. qemu command line:
/usr/libexec/qemu-kvm -S -M rhel6.0.0 -cpu core2duo,+lahf_lm,+dca,+xtpr,+cx16,+tm2,+est,+vmx,+ds_cpl,+pbe,+tm,+ht,+ss,+acpi,+ds -enable-kvm -m 4096 -smp 4,sockets=4,cores=1,threads=1 -name wano -uuid 166a994e-0b88-01af-7403-94bb9b7db4d6 -nodefconfig -nodefaults -chardev socket,id=charmonitor,path=/var/lib/libvirt/qemu/wano.monitor,server,nowait -mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -drive file=/var/lib/libvirt/images/wano.img,if=none,id=drive-virtio-disk0,format=raw,cache=none -device virtio-blk-pci,bus=pci.0,addr=0x6,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -netdev tap,fd=22,id=hostnet0 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:7b:a2:30,bus=pci.0,addr=0x3,bootindex=1 -chardev file,id=charserial0,path=/tmp/VMpxe-boot.log -device isa-serial,chardev=charserial0,id=serial0 -usb -vnc 127.0.0.1:0 -vga cirrus -device AC97,id=sound0,bus=pci.0,addr=0x4 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5

3. DHCP log file attached, 16 rows.

4. Capture was not successful from DHCP server as too much traffic coming and MAC filtering rules didn't work (router does dhcp-forwarding).

5. Tested and same result. Virt-manager didn't load vhost_net automatically.

I hope this updated data will help.

Comment 35 Henri Paasovaara 2011-10-21 11:12:25 UTC
Created attachment 529476 [details]
DHCP server log

Comment 36 Henri Paasovaara 2011-10-21 11:13:37 UTC
Created attachment 529477 [details]
tcpdump from br1

Comment 37 Henri Paasovaara 2011-10-21 11:14:25 UTC
Created attachment 529478 [details]
VM console screencapture

Comment 42 Henri Paasovaara 2011-10-24 10:20:01 UTC
Created attachment 529771 [details]
br1 tcpdump (20111024)

tcpdump from vnet0 and br1 made same time with ipxe console log output

Comment 43 Henri Paasovaara 2011-10-24 10:20:38 UTC
Created attachment 529773 [details]
vnet0 tcpdump (20111024)

Comment 44 Henri Paasovaara 2011-10-24 10:21:32 UTC
Created attachment 529774 [details]
virtio-rom log output (20111024)

Comment 45 Henri Paasovaara 2011-10-24 10:37:38 UTC
Created attachment 529778 [details]
ifconfig vnet0 (20111024)

Comment 55 Ronen Hod 2012-07-23 07:10:07 UTC
Closing.
Not enough info. Maybe it is a setup issue.
Seems like the reporter lost interest.
We can reopen if the problem returns.