Bug 498394

Summary: Intel i386 PV Guest Container Corrupted
Product: Red Hat Enterprise Linux 4 Reporter: Qian Cai <qcai>
Component: kernel-xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED WONTFIX QA Contact: Red Hat Kernel QE team <kernel-qe>
Severity: low Docs Contact:
Priority: low    
Version: 4.7.zCC: clalance, mgahagan, mjenner, pbonzini, xen-maint
Target Milestone: rc   
Target Release: ---   
Hardware: All   
OS: Linux   
Whiteboard:
Fixed In Version: Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2010-05-14 09:06:05 UTC Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: --- Target Upstream Version:
Embargoed:
Bug Depends On:    
Bug Blocks: 458302    

Description Qian Cai 2009-04-30 09:54:18 UTC
Description of problem:
Seen several of those issues when trying to boot a new RHEL4.7.z kernel after upgraded from the RHEL4.7 GA kernel on the Intel i386 PV guest. Some messages from the logs.

...
end_request: I/O error, dev xvda, sector 20951
Buffer I/O error on device xvda1, logical block 10444
lost page write due to I/O error on xvda1
Buffer I/O error on device xvda1, logical block 10445
lost page write due to I/O error on xvda1
Buffer I/O error on device xvda1, logical block 10446
lost page write due to I/O error on xvda1
Buffer I/O error on device xvda1, logical block 10447
lost page write due to I/O error on xvda1
Buffer I/O error on device xvda1, logical block 10448
lost page write due to I/O error on xvda1
...

Then, the guest is failed to start.

# uname -a
Linux hp-bl480c-01.rhts.bos.redhat.com 2.6.18-92.1.24.el5xen #1 SMP Thu Jan 8
11:35:39 EST 2009 i686 i686 i386 GNU/Linux

# virsh list --all
 Id Name                 State
----------------------------------
  0 Domain-0             running
  8 rhel4u7_i386_hvm     blocked
  - rhel4u7_i386_pv      shut off

# virsh start rhel4u7_i386_pv
libvir: Xen Daemon error : POST operation failed: (xend.err "Error creating
domain: (1, 'Internal error', 'xc_dom_do_gunzip: inflate failed (rc=-3)\\n')")
error: Failed to start domain rhel4u7_i386_pv

Version-Release number of selected component (if applicable):
kernel-2.6.9-78.0.22.EL
kernel-xen-2.6.18-92.1.24.el5

How reproducible:
I have seen it at least on two RHTS machines.

hp-bl480c-01.rhts.bos.redhat.com
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56644

dell-pe1955-02.rhts.bos.redhat.com
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56282

Steps to Reproduce:
1. install file container based RHEL4.7 PV guest.
2. install the guest's kernel to the RHEL4.7.z kernel.
3. reboot
  
Actual results:
The guest failed to start.

Expected results:
The guest started successfully with the new kernel.

Additional info:
The PV guest has the following attributes.

pv install of guest=rhel4u7_i386_pv vcpus=1 memory=1024 container=file installer=nfs

Comment 1 Qian Cai 2009-04-30 15:21:06 UTC
I have done some investigation with Martin Jenner and Mike Gahagan on this issue so far.

Those 2 machines are all using Intel Xeon CPUs.

hp-bl480c-01.rhts.bos.redhat.com  
CPUMODEL       Intel(R) Xeon(TM) CPU 3.20GHz
CPUFAMILY      15
CPUMODELNUMBER 6
http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1129

dell-pe1955-02.rhts.bos.redhat.com
CPUMODEL       Intel(R) Xeon(R) CPU 5160 @ 3.00GHz
CPUMODELNUMBER 15
CPUFAMILY      6

The problem is not always reproducible.

hp-bl480c-01.rhts.bos.redhat.com
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56643 -- working
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56644 -- corrupted

dell-pe1955-02.rhts.bos.redhat.com
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56563 -- working
https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56282 -- corrupted

Even on the successful run. It has this message.
 end_request: I/O error, dev xvda, sector 7555901
 Buffer I/O error on device dm-0, logical block 918334
 lost page write due to I/O error on dm-0
https://rhts.redhat.com/testlogs/56563/189640/1586010/guest-rhel4u7_i386_pv.log

Dose it sound like we are still getting corruption, but it does not take out the guest all the time?

Comment 2 Qian Cai 2009-05-10 16:28:22 UTC
I have run additional 10 same tests on one of those affected machines, but apart from 2 jobs were aborted seems due to 4 guests could not talk to the RHTS scheduler, I can't trigger the problem any more. I'll change the severity/priority to low/low due to the unreproducible, but you can close it if feel more appropriate.

Comment 3 Qian Cai 2009-05-10 16:30:31 UTC
(In reply to comment #2)
> I have run additional 10 same tests on one of those affected machines, but

Correction -- on both of those affected machines (5 each).

Comment 4 Paolo Bonzini 2009-06-16 12:41:12 UTC
I couldn't reproduce this either, OTOH I got this:

Badness in local_bh_enable at kernel/softirq.c:141
 [<c01213a8>] local_bh_enable+0x47/0x6f
 [<c0217db9>] skb_checksum+0x133/0x25e
 [<c025160a>] udp_poll+0x66/0x113
 [<c0213ba9>] sock_poll+0x19/0x1d
 [<c016d636>] do_select+0x190/0x2c7
 [<c016d345>] __pollwait+0x0/0x9b
 [<c0144d68>] __kmalloc+0x56/0xd3
 [<c016da6c>] sys_select+0x2e7/0x45c
 [<c010740f>] syscall_call+0x7/0xb

with RH5.2 dom0 and RH4.7.z guest (more or less random, but happens often when running up2date) -- unrelated though.

Comment 5 Paolo Bonzini 2009-06-17 10:28:35 UTC
The badness in local_bh_enable is fixed in RHEL 4.8 (commit 45f38c).

Comment 6 Qian Cai 2010-05-14 09:06:05 UTC
Can't reproduce it. Will re-open it when see it again.