894296 – UEFI PXE boot fails

Bug 894296 - UEFI PXE boot fails

Summary: UEFI PXE boot fails

Keywords:
Status:	CLOSED NOTABUG
Alias:	None
Product:	Fedora
Classification:	Fedora
Component:	grub2
Sub Component:
Version:	18
Hardware:	Unspecified
OS:	Unspecified
Priority:	unspecified
Severity:	high
Target Milestone:	---
Assignee:	Peter Jones
QA Contact:	Fedora Extras Quality Assurance
Docs Contact:
URL:
Whiteboard:
Depends On:
Blocks:	901442
TreeView+	depends on / blocked

Reported:	2013-01-11 11:04 UTC by Lingzhu Xiang
Modified:	2013-07-04 20:56 UTC (History)
CC List:	5 users (show)
Fixed In Version:
Clone Of:
Clones:	901442 (view as bug list)
Environment:
Last Closed:	2013-02-14 14:57:52 UTC
Type:	Bug
Embargoed:
Dependent Products:

Attachments	(Terms of Use)
wireshark-capture.pcap (12.10 KB, application/vnd.tcpdump.pcap) 2013-01-11 11:04 UTC, Lingzhu Xiang	no flags	Details
View All

Description Lingzhu Xiang 2013-01-11 11:04:47 UTC

Created attachment 676801 [details]
wireshark-capture.pcap

Description of problem:

UEFI PXE boot fails and falls back to command line.

Packet capture shows it successfully loaded grub.cfg and a few files. Then it sends two malformed packets of 60 zero bytes and no more packets are sent. There are icmpv6 messages like router solicitation and listener report.

Version-Release number of selected component (if applicable):
grub2-efi-2.00-15.fc18 with tftp module built in.

How reproducible:
Always. Reproduced on Dell XPS 8500 (Firmware: EFI v2.31, likely a EDK2 mod), capable of IPv4/6 PXE.

Steps to Reproduce:
1. Configure DHCP and TFTP server
2. Boot
  
Actual results:
UEFI PXE boot fails and falls back to command line.

Expected results:
It boots into a menu.

Additional info:
The attached wireshark-capture.pcap contains the all packets on the wire (except tftp.block > 1).

I've been quite curious about how the zero bytes packets were generated.

Adding random link delay with netem won't change the number and order of malformed packets.

Malformed packets are sent after certain tftp requests in a deterministic manner.

I added a printf in net/drivers/efi/efinet.c:

static grub_err_t
send_card_buffer (struct grub_net_card *dev,
                  struct grub_net_buff *pack)
{
...
  if (dev->txbusy)
    while (1)
      {
        void *txbuf = NULL;
        st = efi_call_3 (net->get_status, net, 0, &txbuf);
        if (st != GRUB_EFI_SUCCESS)
          return grub_error (GRUB_ERR_IO,
                             N_("couldn't send network packet"));
        if (txbuf == dev->txbuf)
          {
            dev->txbusy = 0;
            break;
          }
        if (txbuf)
            grub_printf("not my txbuf: txbuf=%p dev->txbuf=%p\n", txbuf, dev->txbuf);
        if (limit_time < grub_get_time_ms ())
          return grub_error (GRUB_ERR_TIMEOUT, N_("couldn't send network packet"));
      }

I got "not my txbuf: txbuf=0xd237a698 dev->txbuf=0xce9b6160" before falling back into command line. I hasn't seen txbuf value other than 0xd237a698, but dev->txbuf sometimes changes.

My guess is that the firmware's ipv6 stack sending icmpv6 messages gets into a race condition with GetStatus by grub's efinet. GetStatus in SNP will remove the returned txbuf from the "transmitted buffer queue" and indicate the txbuf has finished transmission. The "not my txbuf" message probably shows grub stole the txbuf of an icmpv6 router solicitation and couldn't get its own txbuf any more. Hence the network stalls.

Grub's efinet uses SNP which only allows exclusive access for one application while the firmware's (EDK2) IPv6 stack uses MNP which is an abstraction over SNP to allow concurrent operation. This looks quite problematic.

But I still have no idea how the zero bytes packets are caused.

Comment 1 Peter Jones 2013-02-14 14:57:52 UTC

There's nothing in the spec that says you can't use SNP when the firmware is using MNP - and in fact there are specific APIs in MNP to tell when this is happening!  There's also no requirement that MNP is used for IPv6 - there's explicit support for it in SNP.

But even aside from that, the firmware should not be filling the network with garbage data.  There is no case where that's not a firmware bug.

With that in mind, I'm closing this.

Note You need to log in before you can comment on or make changes to this bug.