Bug 918016 - PXE Booting no longer works with kernel 3.8.1-201.fc18
Summary: PXE Booting no longer works with kernel 3.8.1-201.fc18
Keywords:
Status: CLOSED WORKSFORME
Alias: None
Product: Fedora
Classification: Fedora
Component: kernel
Version: 18
Hardware: x86_64
OS: Linux
unspecified
high
Target Milestone: ---
Assignee: Neil Horman
QA Contact: Fedora Extras Quality Assurance
URL:
Whiteboard:
Depends On:
Blocks:
TreeView+ depends on / blocked
 
Reported: 2013-03-05 10:12 UTC by Richard
Modified: 2013-05-30 11:36 UTC (History)
10 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2013-05-10 13:15:18 UTC
Type: Bug
Embargoed:


Attachments (Terms of Use)
tftp boot default file (219 bytes, text/plain)
2013-03-11 14:24 UTC, Richard
no flags Details
tcpdump of the broken kernel setup. (983.66 KB, application/octet-stream)
2013-03-11 14:26 UTC, Richard
no flags Details
tcpdump of the working kernel setup. (12.98 MB, application/octet-stream)
2013-03-11 14:27 UTC, Richard
no flags Details

Description Richard 2013-03-05 10:12:58 UTC
Description of problem: PXE booting a LiveCD Image no longer works after upgrading kernel to 3.8.1-201.fc18. The last good kernel I can install is 3.6.10-4
Rebooting into an older kernel and trying the exact same configuration works just fine.

Version-Release number of selected component (if applicable):
dnsmasq-2.65-4.fc18.x86_64
kernel-3.8.1-201.fc18.x86_64
kernel-3.6.10-4.fc18.x86_64


How reproducible:
Setup dnsmasql as a tftp/dhcp server.
Try and boot from any livecd image.

Steps to Reproduce:
1. setup pxe boot server
2. boot a client via pxe
3. watch the server logs for:

Mar  4 22:04:16 recon dnsmasq-tftp[4855]: failed sending /var/lib/tftpboot/pxelinux.0 to 169.254.132.138
  
Actual results: pxe image isn't sent to client and client errors

Expected results: pxe client to boot.

Comment 1 Tomáš Hozza 2013-03-06 12:13:33 UTC
Hi.

I'm not able to reproduce your issue.

PXE server:
-----------
# dnsmasq -d --enable-tftp --tftp-root=/tftproot/tftpboot --dhcp-option=66,"192.168.133.1" --conf-file= --except-interface lo --bind-dynamic --interface eth1 --dhcp-range 192.168.133.128,192.168.133.254 --dhcp-leasefile=/tmp/hosts.leases --dhcp-lease-max=127 --dhcp-no-override --dhcp-boot=pxelinux.0
dnsmasq: started, version 2.65 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack
dnsmasq-dhcp: DHCP, IP range 192.168.133.128 -- 192.168.133.254, lease time 1h
dnsmasq-tftp: TFTP root is /tftproot/tftpboot 
dnsmasq: reading /etc/resolv.conf
dnsmasq: using nameserver 192.168.122.1#53
dnsmasq: read /etc/hosts - 2 addresses
dnsmasq-dhcp: DHCPDISCOVER(eth1) 52:54:00:c2:3f:ae 
dnsmasq-dhcp: DHCPOFFER(eth1) 192.168.133.169 52:54:00:c2:3f:ae 
dnsmasq-dhcp: DHCPDISCOVER(eth1) 52:54:00:c2:3f:ae 
dnsmasq-dhcp: DHCPOFFER(eth1) 192.168.133.169 52:54:00:c2:3f:ae 
dnsmasq-dhcp: DHCPREQUEST(eth1) 192.168.133.169 52:54:00:c2:3f:ae 
dnsmasq-dhcp: DHCPACK(eth1) 192.168.133.169 52:54:00:c2:3f:ae 
dnsmasq-tftp: sent /tftproot/tftpboot/pxelinux.0 to 192.168.133.169
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/102c331e-9486-b2da-d1cf-c39c68fb85d3 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/01-52-54-00-c2-3f-ae not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A885A9 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A885A not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A885 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A88 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A8 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0A not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C0 not found
dnsmasq-tftp: file /tftproot/tftpboot/pxelinux.cfg/C not found
dnsmasq-tftp: sent /tftproot/tftpboot/pxelinux.cfg/default to 192.168.133.169
dnsmasq-tftp: sent /tftproot/tftpboot/vesamenu.c32 to 192.168.133.169
dnsmasq-tftp: sent /tftproot/tftpboot/pxelinux.cfg/default to 192.168.133.169
dnsmasq-tftp: sent /tftproot/tftpboot/vmlinuz to 192.168.133.169
dnsmasq-tftp: sent /tftproot/tftpboot/initrd.img to 192.168.133.169

# rpm -qi dnsmasq
Name        : dnsmasq
Version     : 2.65
Release     : 4.fc18
Architecture: x86_64
...

# uname -r
3.8.1-201.fc18.x86_64

Can you please attach dnsmasq configuration (options) you are using?
Also please attach network communication dump if possible.

Thank you!

Comment 2 Richard 2013-03-06 16:15:43 UTC
Hi,

I get that far too but the boot image doesn't seem to be being sent over the wire. The PXE client then times out and doesn't boot into the OS.

My net is as follows:

wlan0: dhcp IP of 192.168.1.253/255.255.255.0
eth0: static IP of 169.254.0.1/255.255.0.0

dnsmasq started with a modified copy of your suggestion:

dnsmasq -d --enable-tftp --tftp-root=/var/lib/tftpboot \
--dhcp-option=66,"169.254.0.1" \
--conf-file= --except-interface lo \
--bind-dynamic --interface eth0 --dhcp-range 169.254.0.2,169.254.255.254 --dhcp-leasefile=/tmp/hosts.leases \
--dhcp-lease-max=127 --dhcp-no-override --dhcp-boot=pxelinux.0

and this is the result:

dnsmasq: started, version 2.65 cachesize 150
dnsmasq: compile time options: IPv6 GNU-getopt DBus no-i18n no-IDN DHCP DHCPv6 no-Lua TFTP no-conntrack
dnsmasq-dhcp: DHCP, IP range 169.254.0.2 -- 169.254.255.254, lease time 1h
dnsmasq-tftp: TFTP root is /var/lib/tftpboot
dnsmasq: reading /etc/resolv.conf
dnsmasq: using nameserver 192.168.1.254#53
dnsmasq: read /etc/hosts - 5 addresses
dnsmasq-dhcp: DHCPDISCOVER(eth0) 08:00:27:bb:32:40
dnsmasq-dhcp: DHCPOFFER(eth0) 169.254.199.211 08:00:27:bb:32:40
dnsmasq-dhcp: DHCPREQUEST(eth0) 169.254.199.211 08:00:27:bb:32:40
dnsmasq-dhcp: DHCPACK(eth0) 169.254.199.211 08:00:27:bb:32:40
dnsmasq-tftp: error 0 TFTP Aborted received from 169.254.199.211
dnsmasq-tftp: failed sending /var/lib/tftpboot/pxelinux.0 to 169.254.199.211
dnsmasq-tftp: sent /var/lib/tftpboot/pxelinux.0 to 169.254.199.211
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/78f2b504-2ae2-4d75-b1a3-103e48e325ac not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/01-08-00-27-bb-32-40 not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9FEC7D3 not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9FEC7D not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9FEC7 not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9FEC not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9FE not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9F not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A9 not found
dnsmasq-tftp: file /var/lib/tftpboot/pxelinux.cfg/A not found
dnsmasq-tftp: sent /var/lib/tftpboot/pxelinux.cfg/default to 169.254.199.211
dnsmasq-tftp: failed sending /var/lib/tftpboot/vmlinuz0 to 169.254.199.211

As you can see, it part works as the PXE client sents back the UUID (78f2b504-2ae2-4d75-b1a3-103e48e325ac).

Rich

Comment 3 Tomáš Hozza 2013-03-07 09:53:00 UTC
In my case the client starts booting without any problem.

I'm concerned by the following line in your dnsmasq output:
> dnsmasq-tftp: error 0 TFTP Aborted received from 169.254.199.211

It looks more like your client is doing something wrong.

Can you please attach network communication dump between PXE server
and PXE client?

Thank you

Comment 4 Richard 2013-03-07 10:28:02 UTC
> I'm concerned by the following line in your dnsmasq output:
> > dnsmasq-tftp: error 0 TFTP Aborted received from 169.254.199.211

This is reported with the older kernels too.
 
> It looks more like your client is doing something wrong.

That maybe so, but it still boots with older kernels.
The only change I've made is booting the PXE server into a newer kernel.
Everything else is fine.
If I reboot into the older kernel it all works just fine.

> Can you please attach network communication dump between PXE server
> and PXE client?

I will see if I can create on this weekend.

Thanks,
Rich

Comment 5 Tomáš Hozza 2013-03-07 14:06:22 UTC
(In reply to comment #4)
> > I'm concerned by the following line in your dnsmasq output:
> > > dnsmasq-tftp: error 0 TFTP Aborted received from 169.254.199.211
> 
> This is reported with the older kernels too.
>  
> > It looks more like your client is doing something wrong.
> 
> That maybe so, but it still boots with older kernels.
> The only change I've made is booting the PXE server into a newer kernel.
> Everything else is fine. If I reboot into the older kernel it all works just fine.
> 
> > Can you please attach network communication dump between PXE server
> > and PXE client?
> 
> I will see if I can create on this weekend.

Great. Can you please also attach your "default" file from "pxelinux.cfg"
directory in TFTP server root?

Thanks!

Comment 6 Richard 2013-03-11 14:24:28 UTC
Created attachment 708418 [details]
tftp boot default file

Attached is the default pxe boot config file created by livecd-iso-to-pxeboot

Comment 7 Richard 2013-03-11 14:26:04 UTC
Created attachment 708421 [details]
tcpdump of the broken kernel setup.

This is a TCP Dump of the boot process with the broken kernel.

Comment 8 Richard 2013-03-11 14:27:40 UTC
Created attachment 708422 [details]
tcpdump of the working kernel setup.

This is a TCP dump with the older working kernel.
The only difference between the two on system setup is what kernel is running.

Comment 9 Richard 2013-03-11 14:30:33 UTC
I have attached the required tcp dumps.

This all works with the following:

VirtualBox Host Only Adapter

Both my HP Laptop and my Lenovo Laptop work using eth0 on the PXE server and a cross over cable between the two.

And none of the above devices boot when the PXE server is booted with the newer  kernel.

I have just tried this weekends released kernel (3.8.2-206.fc18.x86_64) and its still the same ;-(

Comment 10 Tomáš Hozza 2013-03-15 16:42:46 UTC
(In reply to comment #9)
> I have attached the required tcp dumps.
> 
> This all works with the following:
> 
> VirtualBox Host Only Adapter
> 
> Both my HP Laptop and my Lenovo Laptop work using eth0 on the PXE server and
> a cross over cable between the two.
> 
> And none of the above devices boot when the PXE server is booted with the
> newer  kernel.
> 
> I have just tried this weekends released kernel (3.8.2-206.fc18.x86_64) and
> its still the same ;-(

I have so far no idea where might be the problem. I think it is the kernel or
some configuration (PXE clients) issue. I want to ask you if you could install
dnsmasq [1] with extra debugging output and run it on "bad" kernel. Then please
add the output here as a comment or attachment.

Thank you

[1] http://koji.fedoraproject.org/koji/taskinfo?taskID=5127230

Comment 11 Richard 2013-03-15 22:20:59 UTC
Hi,

Do I need to do anything different to what I did in Comment #2 when running this new version to get the extra debugging info. Otherwise all I get is the same as in Comment #2.

I think we could assume from this testing that its defo a kernel issue as my only current solution to fix this is to use a specific (older) kernel version and if I boot into a newer kernel one it all stops working.

If there's anything else I can do to help, please shout.

Rich

Comment 12 Tomáš Hozza 2013-03-16 10:36:20 UTC
(In reply to comment #11)
> Hi,
> 
> Do I need to do anything different to what I did in Comment #2 when running
> this new version to get the extra debugging info. Otherwise all I get is the
> same as in Comment #2.

All you need is to install the build I provided in comment #10 and run dnsmasq
with exactly the same options as in comment #2. You should see some extra lines
in dnsmasq output. If you don't see anything different than in comment #2 please
check also /var/log/messages.

> I think we could assume from this testing that its defo a kernel issue as my
> only current solution to fix this is to use a specific (older) kernel
> version and if I boot into a newer kernel one it all stops working.

I will most probably change the component to kernel, but want to check
if the cause is in dnsmasq since this issue is very strange.

Thanks!

Comment 13 Richard 2013-03-18 11:28:42 UTC
I installed the RPMS from here:
http://koji.fedoraproject.org/koji/buildinfo?buildID=402775
But it gave no additional output.

Comment 14 Tomáš Hozza 2013-03-18 11:40:45 UTC
Changing the component to Kernel.

Comment 15 Richard 2013-03-20 10:15:51 UTC
I've just tried this with kernel-3.8.3-203.fc18.x86_64 and tftp-server-5.2-6.fc18.x86_64 and is till fails on 3.8.x kernels so it seems to be a kernel thing and not a pxe/tftp configuration issue.

Comment 16 Richard 2013-04-01 12:06:22 UTC
Still a problem on 3.8.4-202.fc18

Comment 17 Richard 2013-04-14 15:15:37 UTC
Still a problem on 3.8.6-203.fc18

Comment 18 Richard 2013-04-16 14:44:56 UTC
Still a problem on 3.8.7-201.fc18.x86_64

Comment 19 Josh Boyer 2013-04-16 14:53:32 UTC
When you say "on 3.8.x", do you mean the kernel running on the tftp server machine is 3.8.x?

You mentioned virtual box in comment #9.  Are you trying to tftp a kernel in a virtual box guest?

If you could explain a bit more about your setup and exactly what is running where, that would be helpful.

Tomas, can you elaborate on why you think this is a kernel problem?  Do you think this is some kind of ethernet driver issue corrupting packets, or?

Comment 20 Richard 2013-04-17 10:50:51 UTC
Hi,

The kernel running on the tftp server is having issues sending PXE images out when its running a kernel 3.8.x or above. kernel-3.7.9-205.fc18.x86_64 and below were ok.

Basically I have a server setup with both Virtual Box (on vboxnet0) and the motherboards eth0 used for PXE booting. eth0 is a AR8131 Gigabit Ethernet nic and vboxnet0 is the standard VirtualBox interface.

I was initially using virtual box only and that stopped working so I setup eth0 and used a HP laptop and a Lenovo laptop to test PXE booting with. Neither of these work with the 3.8.x kernels, but are fine with the 3.7 and 3.6 kernels.

I believe this is a kernel issue as the only change I need to make to my system to get things to work is to reboot into a kernel lower than 3.8.x i.e. 3.6.10-4.fc18.x86_64. I have tried both tftp-server-5.2-6.fc18.x86_64 and dnsmasq-2.65-5.fc18.x86_64 for sending ftfp files and neither work with the 3.8.x kernels on the server side but both are ok with the 3.6.x kernels.

Thanks,

Rich

Comment 21 Richard 2013-04-17 10:52:53 UTC
oh, and yes, I was booting a virtualbox guest VM via PXE from the tftp server running on the same host node.

Comment 22 Josh Boyer 2013-04-17 19:15:44 UTC
(In reply to comment #21)
> oh, and yes, I was booting a virtualbox guest VM via PXE from the tftp
> server running on the same host node.

Is that what you are always trying to boot?  Does PXE booting of an actual machine work?

Comment 23 Richard 2013-04-18 08:46:34 UTC
No, I test PXE booting with both the laptops and also VM's.

When it doesn't work neither pyhsical machines connected via eth0 or VM's connected on vboxnet0 boot with a 3.8.x kernel. 

When using any kernel lower than 3.8.x both pyhsical and VM's boot ok.

Comment 24 Josh Boyer 2013-04-18 12:42:00 UTC
Are the vbox modules still loaded on the server machines?  If so, could you try this without loading any 3rd party modules?

Nobody else has reported issues with PXE booting on 3.8.x and we're at a loss as to why you would be the only person seeing this.

Comment 25 Neil Horman 2013-04-19 19:30:46 UTC
FWIW, looking at the tcpdump, it appears that the tftp server, after getting an ACK to the read request for block 634 (out of 2000-some-odd), it just stops sending frames.  So if it is a kernel problem I would imagine that other network services would also stop working - i.e. you wouldn't be able to ping the server from other operational systems.  But it doesn't sound like thats the case (please correct me if I'm wrong).  I would suggest disabling the tftp service in dnsmasq and installing an alternate tftp server (I use the tftp-server package, which provides tftpd). If that daemon is capable of serving the newer kernels, I would imagine we can conclude that its actually a dnsmasq based problem you're looking at here

Comment 26 Richard 2013-04-20 12:00:18 UTC
I've already tried tftp-server (see comment 15). And tftp is the only network service that stops working with the 3.8.x kernels. SSH etc is all ok. The only thing I do to get the system working again is reboot into a 3.6.x or a 3.7.x kernel. both tftp-server AND dnsmasq both stop working with the newer kernels and are both ok with the older ones.

Comment 27 Neil Horman 2013-04-22 16:58:45 UTC
Ok, thats good information, thank you.  That said it would point to either the application getting confused and not sending on the connection (unlikely since it happens on two separate connections), or frames getting dropped in the kernel.  To track this down I think we need to capture the following during a failed tftp:

1) strace output from the tftp server process
2) Stats ouput from before and after the tftp operation (cat /proc/net/snmp, ethtool -S <ifname>)

3) Use of dropwatch would also be a great help in pinpointing this issue (please let me know if you need usage instructions)

4) /var/log/messages taken from after the failed tftp may also provide clues.


Thanks!

Comment 28 Neil Horman 2013-05-08 20:05:50 UTC
ping, any feedback here?

Comment 29 Richard 2013-05-09 18:34:35 UTC
Hi,

Sorry for the delay... I gave up on the physical host with issues and I built a new guest VM in VirtualBox (on the dodgy host) and set that up as a PXE server. This now works for PXE booting VM guest nodes.

I believe my issue has come from something mixing up when upgrading my FC17 box to FC18 on the fly via fedup and as my VM's work I'm happy for this to be closed as "Won't fix" or "Invalid" as it seems it was my setup that was the issue and not FC18.

However, if we want to follow this up "just for fun" I'll happily try and collect the info required.

Thanks,
Rich

Comment 30 Neil Horman 2013-05-10 13:15:18 UTC
I would like to follow up, unfortunately, I don't have time.  If the problem recurrs however, please reopen this bug, and we can pick it up again.  Thanks!

Comment 31 Richard 2013-05-30 11:36:20 UTC
I've not found the fix, but I've found the cause of my PXE boot issues breaking with recent kernel releases :-)

Running both KeepaliveD and dnsmasq as a tftp server on the same box doesn't work nicely...

KeepaliveD manages a public VIP that moves between two servers depending on which one is master. when one of the two machines is made a master, dnsmasq is started to provide PXE Boot services.

Anyway, without keepalived running, I am able to PXE boot :-)

Does anyone know of a nice alternative to Keepalived?


Note You need to log in before you can comment on or make changes to this bug.