Bug 968829 - VM not PXE network booting correctly with gpxe-roms-qemu 0.9.7
VM not PXE network booting correctly with gpxe-roms-qemu 0.9.7
Status: CLOSED INSUFFICIENT_DATA
Product: Red Hat Enterprise Linux 6
Classification: Red Hat
Component: gpxe (Show other bugs)
6.4
Unspecified Unspecified
unspecified Severity unspecified
: rc
: ---
Assigned To: Alex Williamson
Virtualization Bugs
:
Depends On:
Blocks:
  Show dependency treegraph
 
Reported: 2013-05-30 00:33 EDT by Ian Wienand
Modified: 2015-01-01 19:00 EST (History)
8 users (show)

See Also:
Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of:
Environment:
Last Closed: 2013-06-16 21:04:23 EDT
Type: Bug
Regression: ---
Mount Type: ---
Documentation: ---
CRM:
Verified Versions:
Category: ---
oVirt Team: ---
RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---


Attachments (Terms of Use)
screenshot of the host not booting (13.54 KB, image/png)
2013-05-30 00:33 EDT, Ian Wienand
no flags Details
tcpdump of dhcp transcation for host that does not boot (2.54 KB, text/x-log)
2013-05-30 00:35 EDT, Ian Wienand
no flags Details
A working pxe boot (99.89 KB, text/x-log)
2013-06-03 19:17 EDT, Ian Wienand
no flags Details
Failing pxe boot (2.54 KB, text/x-log)
2013-06-03 19:18 EDT, Ian Wienand
no flags Details
diff between good and bad boot (3.79 KB, text/plain)
2013-06-03 19:24 EDT, Ian Wienand
no flags Details
wireshark log - no filename/next-server provided (2.39 KB, application/octet-stream)
2013-06-07 15:24 EDT, Alex Williamson
no flags Details
pxe boot succesfully (17.99 KB, image/png)
2013-06-14 01:13 EDT, Qunfang Zhang
no flags Details

  None (edit)
Description Ian Wienand 2013-05-30 00:33:44 EDT
Created attachment 754643 [details]
screenshot of the host not booting

Description of problem:

When trying to PXE boot a VM with ROM from gpxe-rom-qemu 0.9.7 only

"no filename or root path specified"

is seen.

I have attached a screenshot of this, and a tcpdump of the dhcp transaction between the host and the affected VM which seems to show everything working as expected from the dhcp side.

I broke in with ctrl-b and set the next-server and filename fields manually, and the host did boot.

On a hunch I then replaced the ROMS in /usr/share/gpxe with those from gpxe-roms-qemu-1.0.1-6.fc17.noarch.rpm and suddenly the vm started booting just fine.

The nic is using virtio, but I tried switching between all the available types (rtl, e1000) and same thing happened.
Comment 1 Ian Wienand 2013-05-30 00:35:52 EDT
Created attachment 754644 [details]
tcpdump of dhcp transcation for host that does not boot
Comment 3 Alex Williamson 2013-06-03 14:36:51 EDT
What dhcp server are you using and can you provide the config?
Comment 4 Alex Williamson 2013-06-03 16:22:48 EDT
If using dnsmasq, please check that you're using dhcp-no-override.  This is a minimal dnsmasq.conf that works for me:

dnsmasq.conf:
interface=br2
dhcp-range=10.0.0.100,10.0.0.254,255.255.255.0,10m
dhcp-option=3,10.0.0.1
dhcp-no-override
enable-tftp
tftp-root=/var/www/html/boot
dhcp-boot=pxelinux.0
no-daemon

/etc/sysconfig/network-scripts/ifcfg-br2 
DEVICE="br2"
ONBOOT="yes"
TYPE="Bridge"
BOOTPROTO="none"
NETMASK="255.255.255.0"
IPADDR="10.0.0.1"
DELAY="0"

# dnsmasq -C dnsmasq.conf

dnsmasq-2.48-13.el6.x86_64
gpxe-roms-qemu-0.9.7-6.9.el6.noarch

From the dnsmasq(8) man page:

       --dhcp-no-override
              Disable  re-use of the DHCP servername and filename
              fields as extra option space. If  it  can,  dnsmasq
              moves  the  boot  server  and  filename information
              (from dhcp-boot) out of their dedicated fields into
              DHCP  options.  This  make extra space available in
              the DHCP packet for options but can,  rarely,  con-
              fuse  old or broken clients. This flag forces "sim-
              ple and safe" behaviour to avoid problems in such a
              case.
Comment 5 Ian Wienand 2013-06-03 19:14:21 EDT
(In reply to Alex Williamson from comment #3)
> What dhcp server are you using and can you provide the config?

It was using ISC dhcp

---
Name        : dhcp
Arch        : x86_64
Epoch       : 12
Version     : 4.1.1
Release     : 34.P1.el6
---

the config isn't anything special

---
subnet 10.16.16.0 netmask 255.255.252.0 {
    authoritative;
    option domain-name-servers 10.16.17.200;
    option domain-name "internal.oslab.priv oslab.priv mpc.lab.eng.bos.redhat.com";
    option routers 10.16.19.254;
    filename "/pxelinux.0";
    next-server 10.16.17.200;
    default-lease-time 21600;         
    max-lease-time 43200;
#    range 10.16.17.160 10.16.17.170;
}
---

digging further; we're using foreman to configure the host.  It's proxy is writing out to /var/lib/dhcp/dhcp.leases directly where it has entries like

---
host grizzly.oslab.priv {
  dynamic;
  hardware ethernet 52:54:00:5c:53:99;
  fixed-address 10.16.17.205;
        supersede server.filename = "pxelinux.0";
        supersede server.next-server = 0a:10:11:c8;
        supersede host-name = "grizzly.oslab.priv";
}
---

for each host.

From the man page

  Disable  re-use of the DHCP servername and filename
  fields as extra option space. If  it  can,  dnsmasq
  moves  the  boot  server  and  filename information
  (from dhcp-boot) out of their dedicated fields into
  DHCP  options.

Looking at the packet trace, I don't think this is happening; the filename seems to be in the right place?

---
08:47:41.789070 IP (tos 0x10, ttl 128, id 0, offset 0, flags [none], proto UDP (17), length 380)
    unused.bootps > unused.bootpc: [udp sum ok] BOOTP/DHCP, Reply, length 352, xid 0x972f3b, Flags [none] (0x0000)
          Your-IP unused
          Server-IP unused
          Client-Ethernet-Address 52:54:00:97:2f:3b (oui Unknown)
          file "pxelinux.0"
          Vendor-rfc1048 Extensions
            Magic Cookie 0x63825363
...
---

To help a bit further, I captured traces from a "bad" boot and a "good" boot (with the later gpxe copied in)
Comment 6 Ian Wienand 2013-06-03 19:17:07 EDT
Created attachment 756571 [details]
A working pxe boot

Log from a host that PXE boots correctly with f17 gpxe binaries
Comment 7 Ian Wienand 2013-06-03 19:18:05 EDT
Created attachment 756572 [details]
Failing pxe boot

Same host as the working pxe boot, but using rhel6 gpxe binaries; fails to boot
Comment 8 Ian Wienand 2013-06-03 19:24:23 EDT
Created attachment 756589 [details]
diff between good and bad boot

diff of the initial dhcp discussion between a good and a bad boot
Comment 9 Alex Williamson 2013-06-04 00:39:25 EDT
I can't figure out how to make this fail.

dhcp-4.1.1-34.P1.el6.x86_64

# cat dhcpd.conf 
subnet 10.0.0.0 netmask 255.255.255.0 {
	authoritative;
	option domain-name-servers 192.168.1.1;
	option domain-name "home";
	option routers 10.0.0.1;
	filename "/pxelinux.0";
	next-server 10.0.0.1;
	default-lease-time 21600;         
	max-lease-time 43200;
}

host foo {
	dynamic;
	hardware ethernet 52:54:00:99:da:74;
	fixed-address 10.0.0.2;
	supersede server.filename = "pxelinux.0";
	supersede server.next-server = 0a:00:00:01;
	supersede host-name = "foo";
}


# dhcpd -d -cf ./dhcpd.conf

The only difference between your good and bad traces is the order of requested (supported?) options.  My trace has them in exactly the same order as your failing case, but works.
Comment 10 Ian Wienand 2013-06-04 00:59:34 EDT
just to confirm, the md5sums of the the "bad" roms are below (rhel64)

# md5sum *
c0359d4e3b952500fdfa5503991eba3f  e1000-0x100e.rom
c7756e48fb0dc44ce7c8bd478819804b  ne.rom
84069f56a6cd94e7ed4fc0a75927856e  pcnet32.rom
14347b82c2ec8837ea8cb6237edfc145  rtl8029.rom
6cbffbf30e6b6bf952f3c713cbe9dc57  rtl8139.rom
a990f917d5136961218800969962e66f  virtio-net.rom

the "good" ones (from f17)

6341e9b36da39d35b0e8e7ed945089fa  8086100e.rom
94a1948fe28a3ee1e97c1b355bfea16b  ne.rom
8ffdf27e83347a1a252984cad0c427ef  pcnet32.rom
98723599a60e1fd62eb097c5bf4f0f75  rtl8029.rom
48ec17064a8eba61a39a2c72782b6e03  rtl8139.rom
7672904fcc6037181a647155e49e4987  virtio-net.rom

hopefully we're testing the same thing...
Comment 11 Alex Williamson 2013-06-04 01:15:04 EDT
Confirmed, I'm testing on an untainted rhel6.4 server with a guest created through virt manager, using virtio-net for a NIC.
Comment 12 Alex Williamson 2013-06-07 15:24:35 EDT
Created attachment 758302 [details]
wireshark log - no filename/next-server provided

Did the test environment change?  I logged in to debug and found neither the old or new ROMs work.  I captured this log showing that my VM is never being offered a filename or next-server.
Comment 13 Ian Wienand 2013-06-11 02:16:11 EDT
Gilles; did anything change?
Comment 14 Gilles Dubreuil 2013-06-13 21:08:50 EDT
Yes, there is a new management (foreman) VM replacing the old one as I had to move on.

That said, I've saved the previous one. It's a 10G KVM/LVM image dump.

I've not tried that issue on the new environment. There are other priority so I don't know when I'll be able to test it again.

Meanwhile, since this is reproduce-able and identified I leave it with you Alex.

Cheers
Comment 15 Alex Williamson 2013-06-13 21:35:29 EDT
(In reply to Gilles Dubreuil from comment #14)
> Meanwhile, since this is reproduce-able and identified I leave it with you
> Alex.

Except it's not reproducible for me and we don't know what the problem is...
Comment 16 Qunfang Zhang 2013-06-14 01:11:18 EDT
Test with the following packages and can not reproduce the issue.
Host:
kernel-2.6.32-384.el6.x86_64
qemu-kvm-0.12.1.2-2.375.el6.x86_64
gpxe-bootimgs-0.9.7-6.9.el6.noarch
gpxe-roms-0.9.7-6.9.el6.noarch
gpxe-roms-qemu-0.9.7-6.9.el6.noarch

md5sume shows same thing with reporter Ian's "bad" ROMs:

[root@localhost gpxe]# md5sum e1000-0x100e.rom ne.rom pcnet32.rom rtl8029.rom rtl8139.rom virtio-net.rom
c0359d4e3b952500fdfa5503991eba3f  e1000-0x100e.rom
c7756e48fb0dc44ce7c8bd478819804b  ne.rom
84069f56a6cd94e7ed4fc0a75927856e  pcnet32.rom
14347b82c2ec8837ea8cb6237edfc145  rtl8029.rom
6cbffbf30e6b6bf952f3c713cbe9dc57  rtl8139.rom
a990f917d5136961218800969962e66f  virtio-net.rom

Steps:
Boot guest and make sure vm boot from network first by append "bootindex=1" to the network parameter:

/usr/libexec/qemu-kvm -cpu SandyBridge,check -M rhel6.4.0 -enable-kvm -m 2G -smp 2,sockets=2,cores=1,threads=1 -name rhel6.4-64 -uuid 9a0e67ec-f286-d8e7-0548-0c1c9ec93009 -nodefconfig -nodefaults -monitor stdio -rtc base=utc,clock=host,driftfix=slew -no-kvm-pit-reinjection -no-shutdown -device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -device virtio-serial-pci,id=virtio-serial0,bus=pci.0,addr=0x7 -drive file=/home/test/RHEL-Server-6.4-64-virtio.qcow2,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x5,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=2 -drive if=none,media=cdrom,id=drive-ide0-1-0,readonly=on,format=raw -device ide-drive,bus=ide.1,unit=0,drive=drive-ide0-1-0,id=ide0-1-0 -netdev tap,id=hostnet0,vhost=on -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:d5:51:8a,bus=pci.0,addr=0x3,bootindex=1 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc :10 -vga std -device intel-hda,id=sound0,bus=pci.0,addr=0x4 -device hda-duplex,id=sound0-codec0,bus=sound0.0,cad=0 -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x6 -qmp tcp:0:5555,server,nowait -global PIIX4_PM.disable_s3=0 -global PIIX4_PM.disable_s4=0

Result:
Guest can go into pxe menu successfully. (Attachment will be uploaded)

Hi, Ian
I'm a virt QE and trying to reproduce this, could you give me some more into and detail steps to reproduce this problem? 

Thanks,
Qunfang
Comment 17 Qunfang Zhang 2013-06-14 01:13:27 EDT
Created attachment 761079 [details]
pxe boot succesfully
Comment 18 Ian Wienand 2013-06-16 21:02:43 EDT
I can't repro this either now :(

We are using different qemu-kvm versions to comment #16.  However, it seems that on the host in question we have updated from el6_4.2 -> el6_4.3 while this bug has been open

---
    Install     qemu-kvm-2:0.12.1.2-2.355.el6_4.2.x86_64     @rhel-6-server-rpms
    Updated qemu-kvm-2:0.12.1.2-2.355.el6_4.2.x86_64               @rhel-6-server-rpms
    Updated qemu-kvm-2:0.12.1.2-2.355.el6_4.3.x86_64           @rhel-6-server-rpms
---

The changelog for that package update doesn't give any smoking guns...

I'm not sure if it's worth pursuing this any further.  Hopefully if anyone else sees this bug it has enough info to narrow things down further.
Comment 19 Paul Armstrong 2015-01-01 19:00:25 EST
Note:
Running lab router for DHCP with dnsmasq, I ran into the same issue. Picks up next-server, misses filename

boots fine when Ctrl+B 
set filename /pxelinux.0

RHEVM 3.4.2-0.2.el6ev
Sat 6.0.4

editing dnsmasq.conf to include dhcp-no-override solved the issue for me.

Note You need to log in before you can comment on or make changes to this bug.