Bug 1802123

Summary: ipxe corrupts large initramfs
Product: Red Hat Enterprise Linux 7 Reporter: Grzegorz Halat <ghalat>
Component: ipxeAssignee: Neil Horman <nhorman>
ipxe sub component: ipxe-bootimgs QA Contact: Raviv Bar-Tal <rbartal>
Status: CLOSED INSUFFICIENT_DATA Docs Contact:
Severity: medium    
Priority: unspecified CC: aklimov, astupnik, cbesson, nhorman, rmetrich
Version: 7.7   
Target Milestone: rc   
Target Release: ---   
Hardware: x86_64   
OS: All   
Whiteboard:
Fixed In Version: Doc Type: If docs needed, set a value
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2020-06-23 16:38:35 UTC Type: Bug
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Attachments:
Description Flags
445M initramfs on 1.5G VM - kernel boots successfully
none
output log of my reproducer attempt none

Description Grzegorz Halat 2020-02-12 12:10:53 UTC
Description of problem:
Some servers can't be booted by iPXE due to initramfs corruption.

Version-Release number of selected component (if applicable):
20180825-2.git133f4c.el7 and upstream built from commit 18dc73d2

How reproducible:
The issue is always reproducible only on some servers, on other servers
it always works correctly. There is no obvious difference in HW/firmware
between working and not working configurations. This also happens even when
initramfs is downloaded via HTTP. 

Steps to Reproduce:
1. Create a large initramfs,
2. Try to boot a server using iPXE
   - result: kernel panic
3. initramfs corruption can be verified by:
 - using a custom kernel with implemented initramfs checksuming
   (this feature doesn't exist in RHEL nor in the upstream, 
    it was implemented in a scratch build of kernel for troubleshooting purposes)
 - using iPXE compiled with DEBUG=initrd:3 
   A such compiled iPXE calculates md5 hash of initramfs.

Actual results:
initramfs passed to the kernel by iPXE is corrupted, the corruption is detected by
the kernel so initramfs is not uncompressed and kernel panics due to lack of init binary.

md5 hash of initramfs is different at each boot.
iPXE compiled with DEBUG=initrd:3 calculates md5 twice and those checksums are always different,
so it look that initramfs is corrupted twice. The second checksum calculated by iPXE matches with
the checksum calculated by the kernel.

Expected results:
iPXE should not corrupt initramfs, server should successfuly boot 


Additional info:
The upstream version of iPXE has been tested, the result is the same.
More details in a private comment because they may contain sensitive data.

Comment 5 Neil Horman 2020-02-12 14:41:30 UTC
can you please post the logs that you have here regardless?  Also, can you capture a tcpdump from the ipxe server (in pcap format) and attach it here?

Comment 7 Neil Horman 2020-02-12 15:08:41 UTC
The fact that the lines are being overwritten on the console is not an issue with the console itself.  You should be able to use minicom or some other utility to dump serial output directly to a log file, which will negate any screen clearing/reset.

The packet loss is an indicator that we should get the tcpdump output I requested above to correlate those errors

Also, what is the uncompressed size of the initramfs?  Looking at your memory map above, you only have 1.9 Gb of storage space under the 4GB address boundary, which is needed for the ipxe image, kernel and initramfs, among other potential mappings.  It is entirely possible your initramfs is just to large for that space.

Lastly, you mentioned in comment 6 that ipxe is compiled with various options.  Is this as custom build of ipxe?  If so, we don't support that, and I'd ask that you reproduce this issue with a supported build

Comment 10 Neil Horman 2020-02-13 11:57:24 UTC
The size definitely matters.  The checksum of the file image is based on the uncompressed image, so if tmpfs is running out of space, the checksum will be wrong.  Please confirm the size of the uncompressed image

Comment 11 Neil Horman 2020-02-13 15:39:15 UTC
FWIW, Looking at the png you provided, it looks like that particular boot had a ramdisk that was a total of 444MB (I think compressed), meaning you would need a little over a gig of space to decompress it.  Thats going to give you a few hundred MB of space to store the kernel (10MB), the ipxe binary(10MB) and the heap and stack space that the drivers need to run properly, you may legitimately be running out of space in the ipxe environment (which all has to exist under the 4GB mark).  Suggest that you compare a system that consistently works - specifically the E820 map to see how much ram is available on those servers under the 4GB mark.

Comment 12 Grzegorz Halat 2020-02-13 16:46:34 UTC
(In reply to Neil Horman from comment #11)
>Please confirm the size of the uncompressed image
I asked the customer to upload the initramfs which we were using during remote sessions.

> (...) Suggest that you
> compare a system that consistently works - specifically the E820 map to see
> how much ram is available on those servers under the 4GB mark.

We had this idea during the last remote session -  boot a working server via iPXE and collect dmesg to compare it with not working server. Unfortunately we had some issues and we run out of time. We will try again during the next session.

Comment 13 Neil Horman 2020-02-13 19:53:44 UTC
ok, please let me know

Comment 15 Grzegorz Halat 2020-02-14 16:40:46 UTC
Created attachment 1663151 [details]
445M initramfs on 1.5G VM - kernel boots successfully

Comment 16 Neil Horman 2020-02-15 12:42:07 UTC
I've tried to explain this in comment 7.  pxe boot environments operate entirely in the 32 bit address range, meaning only memory in the e820 map below 4GB is accessible.

On the VM that you tested on, the e820 map looks like this:
0x0000000000000000-0x000000000009fbff usable
0x000000000009fc00-0x000000000009ffff reserved
0x00000000000f0000-0x00000000000fffff reserved
0x0000000000100000-0x00000000bb7dffff usable
0x00000000bb7e0000-0x00000000bb7fffff reserved
0x00000000feffc000-0x00000000feffffff reserved
0x00000000fffc0000-0x00000000ffffffff reserved

The usable sections that use less that 32 significant bits of address space amount to (0xbb7dffff-0x10000)+(0x9fbff) = 0xbb7fbfe = 3145202686 / (1024^3) = 2.9Gb of available RAM.

On the failing system the e820 map looks like this:
0x0000000000100000-0x0000000077e07fff usable
0x0000000000000000-0x0000000000098fff usable
0x0000000000099000-0x000000000009ffff reserved
0x00000000000e0000-0x00000000000fffff reserved
0x0000000000100000-0x0000000077e07fff usable
0x0000000077e08000-0x000000007ee45fff reserved
0x000000007ee46000-0x000000007ef58fff ACPI data
0x000000007ef59000-0x000000007f168fff ACPI NVS 
0x000000007f169000-0x000000007f27cfff reserved
0x000000007f27d000-0x000000007f7fffff ACPI NVS
0x0000000080000000-0x000000008fffffff reserved
0x00000000fed1c000-0x00000000fed3ffff reserved
0x00000000ff000000-0x00000000ffffffff reserved
0x0000000100000000-0x0000000f7fffffff usable

It has lots more usable ram sections, but only the first 2 sections fit under the 32 bit address space limit (there is a 3rd section that does, but its a duplicate, not sure why its there).  Regardless, the 32 bit usable address space memory on this system is:
(0x77e07fff−0x100000)+98fff = 0x77DA0FFE = 2010779646 / (1024^3) = 1.8G Ram

if the compressed file size of the initramfs is approximately .5 Gb and the uncompressed size is 1.2Gb you are taking up 1.7Gb of your 1.8Gb of available ram before you factor in the kernel size, ipxe image, heap and stack.

You are running out of memory.

Comment 19 Neil Horman 2020-02-29 14:01:46 UTC
can you provide links to the failing and non-failing initramfs?  This doesn't appear to have anything to do with ipxe or decompression.  According to these logs:

1) ipxe validated the downloaded inintramfs' md5sum:
INITRD squashing agent.ramdisk [0x5b991000,0x77618495)->[0x5bfff000,0x77c86495)
INITRD agent.ramdisk at [0x5bfff000,0x77c86495)
md5sum ( 0x5bfff000, 0x1bc87495 ) = 5da87dd14488ec7c2eb0cf40fbf98cad

2) the kernel never gets to the deompression phase, because it can't identify the decompression type listed at the head of the initramfs from the files magic numbers.  That would suggest that we have an initramfs that passes its integrity check before booting the kernel (meaning its unchanged from whats on the server), but that cannot be identified as having a valid decompression algorithm.  If you can post links to the failing and working initramfs images we can look into that further

Comment 21 Neil Horman 2020-03-03 11:30:32 UTC
well, it looks like I owe you an apology.  The md5sum doesn't in fact match.  I'll spend some time setting up a reproducer to see if I can't get better visibility on whats going on here.

Comment 28 Neil Horman 2020-03-04 17:24:30 UTC
attaching my reproducer attempt, which unfortunately, seems to work

Comment 29 Neil Horman 2020-03-04 17:25:58 UTC
Created attachment 1667556 [details]
output log of my reproducer attempt

Comment 37 Neil Horman 2020-03-12 14:08:07 UTC
ok, then I would return to my thought in comment 33.  Its probably worth a test to limit the low memory available to ipxe so as to test if the e820 map on the compute server isn't somehow bad, and stepping on some device space.

Comment 40 Neil Horman 2020-04-05 14:40:35 UTC
any update here?

Comment 42 Neil Horman 2020-04-08 11:33:30 UTC
first of all, nice work!  Thats a really good find

Next, yes, if you could, at your next opportunity use the e820 map to exclude that region, just to confirm that we can make this work, that would be great.

Looking at the data, here are my immediate observations:
1) its not ascii, so its likely not from an input device
2) I almost see an ethernet oui in the data (there is a repeating pattern of 52 54 60), the first two bytes of which are a realtek oui, but the 60 doesn't match anything.  might be worth checking to see if any of the nics on board have a mac that start with 52 54 60
3) The data is somewhat patterned.  Every block starts with either a 00 or 20, and is followed by 32 d6 50 d3.  Makes it seem like its an informational header of some sort, though I can't figure out exactly what it is.  I thought perhaps it was an smbios system boot information block, but it doesn't match up
4) The fact that hw breakpoints aren't triggered is somewhat telling.  Normally hw breakpoints are implemented by snooping the frontside bus on the cpu for writes to specific addresses, but since its not triggering, that suggests that the change/corruption is occuring due to a write from a device (i.e. a dma operation). Is it possible to disable the iommu from bios on this system?  It might not do anything but it might be an interesting test to see if doing so changes where the corrupted memory lives.

Comment 46 Neil Horman 2020-06-04 10:42:36 UTC
ping any update here?

Comment 49 Neil Horman 2020-06-11 21:11:47 UTC
ok, so thats good news.  Where does that leave us however?  We seem to have a system that has a corrupted section of the e820 map in it?  Is it time to contact the system vendor and ask them what sort of firmware updates are available?

Comment 51 Neil Horman 2020-06-23 16:38:35 UTC
Copy that, thanks for the update.  Wish we could have figured out the root cause here