Bug 224214 - corrupted RPMs when installing full-virt guest
Summary: corrupted RPMs when installing full-virt guest
Keywords:
Status: CLOSED NOTABUG
Alias: None
Product: Red Hat Enterprise Linux 5
Classification: Red Hat
Component: kernel-xen
Version: 5.0
Hardware: i686
OS: Linux
medium
medium
Target Milestone: ---
: ---
Assignee: Steven Rostedt
QA Contact:
URL:
Whiteboard:
Depends On: 218926
Blocks:
TreeView+ depends on / blocked
 
Reported: 2007-01-24 17:30 UTC by Stephen Tweedie
Modified: 2009-02-16 19:41 UTC (History)
8 users (show)

Fixed In Version:
Doc Type: Bug Fix
Doc Text:
Clone Of:
Environment:
Last Closed: 2009-02-16 19:41:42 UTC
Target Upstream Version:
Embargoed:


Attachments (Terms of Use)
bnx2: update firmware to correct rx problem in promisc mode (78.32 KB, patch)
2007-02-22 13:46 UTC, Stephen Tweedie
no flags Details | Diff

Description Stephen Tweedie 2007-01-24 17:30:43 UTC
+++ This bug was initially created as a clone of Bug #218926 +++

Description of problem:
Installing a full-virt xen guest is failing, when exaclty the same install
succeeds on non-virtual hardware.

Version-Release number of selected component (if applicable):
kernel-xen-2.6.18-1.2849.fc6
xen-3.0.3-0.1.rc3
virt-manager-0.2.6-2.fc6

How reproducible:
Always

Steps to Reproduce:
1. Create a full-virt guest, via virt-manager 
2. Boot RHEL3 U6 32 bit install media
3. Kickstart install of guest, with no manual intervention.
  
Actual results:
Install halts midway through installing RPMs. Reports that there was an error
installing a particular package.

/mnt/sysimage/root/install.log contains:
<snip>
Installing comps-3AS-0.20050921.i386.
error: unpacking of archive failed on file
/usr/share/comps/i386/hdlist2;4579147a: cpio: MD5 sum mismatch

N.B. This occurence, with the comps RPM, is the most recent. Earlier, the same
issue occured with the glibc RPM.

/mnt/sysimage/var/tmp/ contains a file called "comps.rpm", whose md5sum, and
size, are not the same as the rpm in the install media.

Using the same boot media, and the same kickstart file, repeatedly installs
without errors on non-virtual hardware.

Expected results:

Successful insttallation (as on "real" hardware)

Additional info:

Anaconda is being run with the following arguments: 
skipddc nofb nousb ksdevice=eth0 ip=172.16.100.50 netmask=255.255.255.128
gateway=172.16.100.1 dns=139.149.131.156
ks=http://172.16.100.41/172.16.100.50-ks.cfg

Anaconda is being directed to retrieve the RPMs via http.

There's plenty of unused space on each volume under /mnt/sysimage

-- Additional comment from sct on 2006-12-08 10:38 EST --
I've got only a few ideas.  We've had problems with interactions between Xen
networking and IP checksum code in the past --- could you perhaps run with an
httpd served from the local host to eliminate that possibility?  What NIC are
you using, too?


-- Additional comment from sct on 2006-12-08 10:40 EST --
Adding Herbert to CC, as this appears possibly to involve checksum problems over
the network (to be verified.)

-- Additional comment from athomas on 2006-12-08 12:27 EST --
Retried, having moved the http install tree to dom0. The guest install still
aborts, with anaconda reporting md5sum errors in unpacking comps-3AS-0.20050921.i386

I've verified that the comps-3AS-0.20050921.i386 RPM under the DocumentRoot on
dom0 is valid.

Please disregard the reference above to comps.rpm under /mnt/sysimage/var/tmp.
That is an uncorrupted copy of the comps.xml from under base/ on the install
media and is unrelated to the comps-3AS-0.20050921.i386 RPM that is failing to
install.

The machine has two NICs. lspci describes them both as:
06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit
Ethernet Controller (Copper) (rev 01)

-- Additional comment from herbert.xu on 2006-12-11 00:35 EST --
Hmm, I'm not sure how a network checksum error can cause file size to change. 
The most likely result of a checksum error is the packet being dropped and the
connection hung.

What exactly is the size difference? Is it just truncation? If so is the
non-truncated part identical to the original?

-- Additional comment from athomas on 2006-12-11 05:39 EST --
In fact, there isn't a size difference. I was mistakenly assuming that the
comps.rpm on local disk was a copy of the comps package that is failing to
install. The two different sizes are actually different packages.


Here's the real problem:

The relevant error message is that /mnt/sysimage/root/install.log contains:
<snip>
Installing comps-3AS-0.20050921.i386.
error: unpacking of archive failed on file
/usr/share/comps/i386/hdlist2;4579147a: cpio: MD5 sum mismatch

I've used "rpm -K" to verify that the version of the comps-3AS-0.20050921.i386
rpm that is being served by httpd running in dom0 isn't corrupt.



-- Additional comment from jmh on 2006-12-11 10:56 EST --

Tried the same on a RHEL5 Beta 2 (2747) system (woodie) and run into the same
problems with checksum errors if HTTP is used to install RHEL-3 (tried U6 and U7).
Using NFS or FTP works fine. Also installing RHEL-4 via HTTP works fine , So it
seems the problem is "isolated" to RHEL-3 install via HTTP ....
If someone wants to login to the RHEL5 Beta 2 systems please contact me and I'll
provide the details offline.



-- Additional comment from sct on 2006-12-11 11:26 EST --
OK, many thanks.  Is woodie running the same (Intel 1G) NIC?

-- Additional comment from sct on 2006-12-11 11:29 EST --
Created an attachment (id=143296)
[XEN] Extend emulator to fully decode ModRM and SIB bytes.

Please also try with this patch.  It's a hypervisor patch, so much faster to
test than anything against the kernel.	This is from upstream changeset 12528,
backported to current CVS HEAD RHEL5 kernel-xen.

-- Additional comment from jmh on 2006-12-12 05:29 EST --

Yes our local system is using the same Ethernet 1Gb card 
[root@woodie ~]# lspci|grep Ethe
06:00.0 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet
Controller (Copper) (rev 01)
06:00.1 Ethernet controller: Intel Corporation 80003ES2LAN Gigabit Ethernet
Controller (Copper) (rev 01)

Also have been doing some more testing with no major breakthrough .
Tried various newer kernels but still keep getting corrupted RPMs 

I installed/booted the following kernels :

                                    Installation via 
Kernel Revision                 RHEL3/HTTP   RHEL3/FTP    RHEL4/HTTP
RHEL5 Beta 2 (rev 2747)           FAIL        SUCCESS      SUCCESS
RHEL5 RC3    (rev 2839)           FAIL        SUCCESS      SUCCESS
RHEL5 RC2    (rev 2817)           FAIL        SUCCESS      SUCCESS

Initial results brought up hope RC3/2839 may indeed help but 
 after 2 successful installations we still ended up with corrupted RPMs .

I have not yet tested with the above patch as the nightly build had not
completed on the 11th .

Will try earlier kernels/HVs to see when/if it ever worked .



-- Additional comment from athomas on 2006-12-12 08:33 EST --
Using a hypervisor which has the patch from Comment #8, I've been unable to
reproduce the error after several attempts.



-- Additional comment from jmh on 2006-12-12 09:58 EST --

Update on testing with RHEL5 B2 + HV patch from #8 .
I have now installed 15 RHEL3 guests using HTTP without any problems. In the
past we've never been able to get through more than 3 guest installs without
hitting the corruption problem.



-- Additional comment from sct on 2006-12-12 18:07 EST --
Patch seems to work, reassigning for FC6 merge.

-- Additional comment from athomas on 2007-01-19 08:18 EST --
The same behavior has returned.

I'm running 32 bit kernel-xen-2.6.18-4.el5 on a 2.0GHz Intel Xeon 5130 Woodcrest
Dual-Core, installing RHEL3 via HTTP into a full-virt guest.

Same results as previoulsy: The install is terminating, complaining about an
error installing glibc. /mnt/sysimage/root/install.log reports an MD5 sum error
mismatch on a specific library from within the RPM.

As previously, the RPM is not corrupted.

-- Additional comment from athomas on 2007-01-19 10:13 EST --
I should have mentioned that the failure from Commment #13 is reproducable.

By contrast, I've just used the same boot media & kickstart, against the same
install tree, to successfuly create the RHEL3 guest on a woodcrest running the
64 bit kernel-xen-2.6.18-1.2910.el5.

Unless the 64/32 bitness makes a comparison irrelevant, the issue may be a
regression since 2.6.18-1.2910.el5.

I'll retest with an older 32 bit kernel

-- Additional comment from athomas on 2007-01-22 11:11 EST --
Moving back to an older kernel-xen RPM to check when a regression occured hasn't
been successful. I've switched back to the 32 bit kernel-xen-2.6.18-1.3014.el5
RPM, but that fails to install in a totally different way: Hanging during kernel
boot.

With the latest 32 bit RHEL5 kernel-xen RPM (2.6.18-4.el5), I'm still able to
reproduce the memory corruption during install.

Comment 1 RHEL Program Management 2007-01-24 18:01:48 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red
Hat Enterprise Linux major release.  Product Management has requested further
review of this request by Red Hat Engineering, for potential inclusion in a Red
Hat Enterprise Linux Major release.  This request is not yet committed for
inclusion.

Comment 2 Stephen Tweedie 2007-01-24 18:17:59 UTC
"I've switched back to the 32 bit kernel-xen-2.6.18-1.3014.el5
RPM, but that fails to install in a totally different way: Hanging during kernel
boot."  

What fails to boot, the dom0 or the domU?  HVM was known broken in that release,
btw.  kernel-2_6_18-1_2962_el5 was the previously-known working version;
2.6.18-2.el5 fixed HVM again.

Comment 3 Stephen Tweedie 2007-01-24 18:38:40 UTC
Also, exactly which RHEL-3 are you using?  Just tried RHEL3-U8 w/ NFS and that
works fine, will try HTTP next.


Comment 4 Stephen Tweedie 2007-01-24 18:43:27 UTC
"By contrast, I've just used the same boot media & kickstart, against the same
install tree, to successfuly create the RHEL3 guest on a woodcrest running the
64 bit kernel-xen-2.6.18-1.2910.el5.

Unless the 64/32 bitness makes a comparison irrelevant, the issue may be a
regression since 2.6.18-1.2910.el5."

32-vs-64-bit host certainly makes a huge difference for HVM, yes.

Comment 5 Stephen Tweedie 2007-01-24 18:49:06 UTC
Also, so that we can be sure we're testing like for like, what exact guest
configuration are you using?  file or block backed, vcpus, memory etc?  Thanks.


Comment 6 Stephen Tweedie 2007-01-24 20:40:29 UTC
RHEL-3 U8 i386 install on kernel-xen-2.6.18-4.el5xen.i686 host over HTTP just
finished fine for me.  600mb guest; I'm retrying with a 300mb guest now, but
that is already 32% into the process, with no problems so far.


Comment 7 Stephen Tweedie 2007-01-24 21:17:59 UTC
Just had 3 consecutive guest installs complete with no problem.  We'll really
need a bit more specific detail about the configuration that's failing, I think.

Comment 13 Angus Thomas 2007-01-29 16:16:06 UTC
The most recent results (installing rhel3u8), were repeated after rebooting dom0
with "mem=2g" as an argument to the hypervisor. The machines have 8 gig of RAM
installed.


Comment 14 Stephen Tweedie 2007-01-29 23:43:00 UTC
Can you please try to reproduce with the tree held locally on the dom0 disk, and
exported via http from there?  I'd like to be able to eliminate the NIC as a
possible factor here.


Comment 15 Angus Thomas 2007-01-30 12:56:33 UTC
I have been able to reproduce the error, with the install tree exported from
dom0 via http.

Physically replacing the broadcom NICs with some other hardware would be tricky,
since this is happening on blades, which have their NICs integrated on the board.


Comment 16 Stephen Tweedie 2007-01-30 23:44:40 UTC
No matter, there's no point in replacing the NIC --- if you can reproduce from a
dom0 httpd, then that pretty well eliminates the NIC/driver/swiotlb from the
equation.

Are you able to test with a current xen-unstable snapshot?

Comment 18 Stephen Tweedie 2007-02-22 13:46:53 UTC
Created attachment 148564 [details]
bnx2: update firmware to correct rx problem in promisc mode

Note: to reproduce on xen-unstable using bnx2 net drivers, you'll need this
patch from the RHEL-5 tree in order to get working networking.

Comment 20 Bob Gautier 2007-03-06 16:39:32 UTC
Tried again using the backported bnx2 driver on the 2.6.16.29 kernel, under
3.0.3-branched, and I still see the rpm corruption problem.

It seems to me that the anaconda environment is not good for debugging, and
since we can successfully build RHEL3 on *real* hardware, would it be helpful to
'clone' a RHEL3 to make a full-virt RHEL3 guest that we can run tests on?


Comment 21 Stephen Tweedie 2007-03-06 18:08:22 UTC
Trouble is, the installer is the only known reproducer for this problem, and
even it can't reproduce if we use NFS or FTP install, only HTTP.  So introducing
more variables is likely to be painful at this stage: there's just no guarantee
that we'll be able to find an alternative reproducer to run inside the guest.


Comment 23 Stephen Tweedie 2007-03-13 16:08:56 UTC
What was the result of the 3.0.3-branched testing, and are there any
3.0.4-branched results yet?  Thanks!


Comment 24 Bob Gautier 2007-03-14 13:06:52 UTC
3.0.4-branched seems to work fine, so we are continuing to home in on the point
where the fix was made.

Comment 25 Bob Gautier 2007-03-14 15:28:45 UTC
Revision 12363:25e6a17e82f0 is broken.  3 failures out of 3.

Also, in case it might be significant, I note that on switching VC to check the
reported cause of the install failure (ALT-F2), the screen I get (normally blank
with a # shell prompt at the top) is not blank, but is instead the initial
ISOLINUX screen, with the boot prompt and the first few lines of the installer
kernel output showing (Loading vmlinux...).  Also the top-left character is
blank: no initial '#' prompt.  The shell works fine, however.


Comment 27 RHEL Program Management 2007-09-07 19:56:26 UTC
This request was previously evaluated by Red Hat Product Management
for inclusion in the current Red Hat Enterprise Linux release, but
Red Hat was unable to resolve it in time.  This request will be
reviewed for a future Red Hat Enterprise Linux release.

Comment 28 Chris Lalancette 2007-09-28 21:14:51 UTC
This one is kind of stale now; since it seems from the comments that 3.0.4 was
working, and 5.1 is now based on 3.1.0, can we re-test this environment and see
if this issue is now fixed?

Thanks,
Chris Lalancette

Comment 29 Bill Burns 2009-02-16 19:41:42 UTC
Closing this. Issue has not been seen for a couple releases. No response to need info request. Please reopen if you disagree.


Note You need to log in before you can comment on or make changes to this bug.