Bug 507391

Summary: qemu-kvm PXE boot with e1000 results in bogus packets
Product: [Fedora] Fedora Reporter: Gilboa Davara <gilboad>
Component: etherbootAssignee: Mark McLoughlin <markmc>
Status: CLOSED ERRATA QA Contact: Fedora Extras Quality Assurance <extras-qa>
Severity: medium Docs Contact:
Priority: high    
Version: 11CC: dwmw2, ehabkost, gcosta, itamar, jaswinder, kari.hautio, markmc, pcfe, virt-maint
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: 5.4.4-16.fc11 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2009-07-02 05:41:58 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 480594, 494832    
Attachments:
Description Flags
DSL VM configuration
none
Private bridge configuration. (Bridge running in promisc mode)
none
tap42 wireshark recording. none

Description Gilboa Davara 2009-06-22 16:03:07 UTC
Created attachment 348934 [details]
DSL VM configuration

Description of problem:
I've upgraded my first KVM host to F11.
I'm trying to boot DSL (Damn Small Linux) using bootpxe.
This test works just fine under F9 and F10.

Version-Release number of selected component (if applicable):
qemu-0.10.4-4.fc11.x86_64

How reproducible:
Always

Steps to Reproduce:
1. Setup a private bridge. (Configuration attached.)
2. Setup a qemu empty VM. (Configuration attached.)
3. Boot.

Actual results:
Client fails to receive an IP. Host sees invalid packets. (pcap attached)

Expected results:
boot.

Comment 1 Gilboa Davara 2009-06-22 16:04:02 UTC
Created attachment 348935 [details]
Private bridge configuration. (Bridge running in promisc mode)

Comment 2 Gilboa Davara 2009-06-22 16:05:37 UTC
Created attachment 348936 [details]
tap42 wireshark recording.

Comment 3 Gilboa Davara 2009-06-22 16:31:18 UTC
P.S. dhcp works just fine, once the OS actually boots.

Comment 4 Mark McLoughlin 2009-06-22 16:49:22 UTC
What version of etherboot is this? Does etherboot-5.4.4-15.fc11 help?

  https://admin.fedoraproject.org/updates/etherboot-5.4.4-15.fc11

I doubt it - those frames are pretty messed up. Does it work with e.g. rtl8139, virtio, ne2k_pci or pcnet?

Comment 5 Gilboa Davara 2009-06-22 17:56:57 UTC
Works just fine with rtl8139 with etherboot-5.4.4-13.
I'm still getting trashed 0xff frames with etherboot-5.4.4-15.

- Gilboa

Comment 6 Mark McLoughlin 2009-06-23 11:40:12 UTC
Okay, so the packet dump shows the type field in the ethernet header is (incorrectly) zero.

Enabling debugging in etherboot-5.4.4/drivers/net/e1000.c made the problem go away, which was the first clue.

The code is as follows:

    struct eth_hdr {
        unsigned char dst_addr[ETH_ALEN];
	unsigned char src_addr[ETH_ALEN];
        unsigned short type;
    } hdr;
    ...
    hdr.type = htons (type);
    txhd = tx_base + tx_tail;
    tx_tail = (tx_tail + 1) % 8;
    ...
    txhd->buffer_addr = virt_to_bus (&hdr);
    ...
    E1000_WRITE_REG (&hw, TDT, tx_tail);

i.e. we're setting the type in the header on the stack, setting up a tx descriptor to point to header on the stack and then writing the descriptor number to the device queue.

Looking at the assembly, I see:

     36d:       8b 4c 24 38             mov    0x38(%esp),%ecx
     371:       86 cd                   xchg   %cl,%ch
     ...
     3fb:       89 90 18 38 00 00       mov    %edx,0x3818(%eax)
     ...
     407:       66 89 4c 24 1e          mov    %cx,0x1e(%esp)

i.e. we're only actually moving the results of the htons() into the header on the stack until after we've set the TDT register. At that point the packet has already been sent.

The problem is that the compiler has no way of knowing this memory is used as a result of us writing to the register. So, if we do:

-       struct eth_hdr {
+       volatile struct eth_hdr {

we see:

     36c:       8b 44 24 38             mov    0x38(%esp),%eax
     370:       86 c4                   xchg   %al,%ah
     372:       66 89 44 24 1e          mov    %ax,0x1e(%esp)
     ...
     400:       89 90 18 38 00 00       mov    %edx,0x3818(%eax)

This fixes the problem.

Comment 7 Mark McLoughlin 2009-06-23 11:47:56 UTC
* Tue Jun 23 2009 Mark McLoughlin <markmc> - 5.4.4-16
- Fix e1000 PXE boot - caused by compiler optimization (bug #507391)

Comment 8 Mark McLoughlin 2009-06-23 11:57:41 UTC
*** Bug 494541 has been marked as a duplicate of this bug. ***

Comment 9 Fedora Update System 2009-06-23 11:59:01 UTC
etherboot-5.4.4-16.fc11 has been submitted as an update for Fedora 11.
http://admin.fedoraproject.org/updates/etherboot-5.4.4-16.fc11

Comment 11 Fedora Update System 2009-06-27 02:58:25 UTC
etherboot-5.4.4-16.fc11 has been pushed to the Fedora 11 testing repository.  If problems still persist, please make note of it in this bug report.
 If you want to test the update, you can install it with 
 su -c 'yum --enablerepo=updates-testing update etherboot'.  You can provide feedback for this update here: http://admin.fedoraproject.org/updates/F11/FEDORA-2009-7024

Comment 12 Gilboa Davara 2009-06-29 14:32:14 UTC
etherboot-5.4.4-16.fc11.noarch seems to solve the problem.

- Gilboa

Comment 13 Kari Hautio 2009-06-30 11:19:22 UTC
etherboot-5.4.4-16.fc11 works for me also and solves no IP problem (bug #494541)

Comment 14 Mark McLoughlin 2009-06-30 13:22:58 UTC
Gilboa and Kari, thanks for testing - I'll push to stable now

Note, in future, if you go to the update url:

  https://admin.fedoraproject.org/updates/F11/FEDORA-2009-7024

you can login and add a comment - this increases the update's 'karma'; if enough people comment, the update gets pushed automatically

Comment 15 Gilboa Davara 2009-06-30 14:57:54 UTC
Thanks. Will do.

- Gilboa

Comment 16 Fedora Update System 2009-07-02 05:41:47 UTC
etherboot-5.4.4-16.fc11 has been pushed to the Fedora 11 stable repository.  If problems still persist, please make note of it in this bug report.