Description of problem: Seen several of those issues when trying to boot a new RHEL4.7.z kernel after upgraded from the RHEL4.7 GA kernel on the Intel i386 PV guest. Some messages from the logs. ... end_request: I/O error, dev xvda, sector 20951 Buffer I/O error on device xvda1, logical block 10444 lost page write due to I/O error on xvda1 Buffer I/O error on device xvda1, logical block 10445 lost page write due to I/O error on xvda1 Buffer I/O error on device xvda1, logical block 10446 lost page write due to I/O error on xvda1 Buffer I/O error on device xvda1, logical block 10447 lost page write due to I/O error on xvda1 Buffer I/O error on device xvda1, logical block 10448 lost page write due to I/O error on xvda1 ... Then, the guest is failed to start. # uname -a Linux hp-bl480c-01.rhts.bos.redhat.com 2.6.18-92.1.24.el5xen #1 SMP Thu Jan 8 11:35:39 EST 2009 i686 i686 i386 GNU/Linux # virsh list --all Id Name State ---------------------------------- 0 Domain-0 running 8 rhel4u7_i386_hvm blocked - rhel4u7_i386_pv shut off # virsh start rhel4u7_i386_pv libvir: Xen Daemon error : POST operation failed: (xend.err "Error creating domain: (1, 'Internal error', 'xc_dom_do_gunzip: inflate failed (rc=-3)\\n')") error: Failed to start domain rhel4u7_i386_pv Version-Release number of selected component (if applicable): kernel-2.6.9-78.0.22.EL kernel-xen-2.6.18-92.1.24.el5 How reproducible: I have seen it at least on two RHTS machines. hp-bl480c-01.rhts.bos.redhat.com https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56644 dell-pe1955-02.rhts.bos.redhat.com https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56282 Steps to Reproduce: 1. install file container based RHEL4.7 PV guest. 2. install the guest's kernel to the RHEL4.7.z kernel. 3. reboot Actual results: The guest failed to start. Expected results: The guest started successfully with the new kernel. Additional info: The PV guest has the following attributes. pv install of guest=rhel4u7_i386_pv vcpus=1 memory=1024 container=file installer=nfs
I have done some investigation with Martin Jenner and Mike Gahagan on this issue so far. Those 2 machines are all using Intel Xeon CPUs. hp-bl480c-01.rhts.bos.redhat.com CPUMODEL Intel(R) Xeon(TM) CPU 3.20GHz CPUFAMILY 15 CPUMODELNUMBER 6 http://lab.rhts.bos.redhat.com/cgi-bin/rhts/system.cgi?id=1129 dell-pe1955-02.rhts.bos.redhat.com CPUMODEL Intel(R) Xeon(R) CPU 5160 @ 3.00GHz CPUMODELNUMBER 15 CPUFAMILY 6 The problem is not always reproducible. hp-bl480c-01.rhts.bos.redhat.com https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56643 -- working https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56644 -- corrupted dell-pe1955-02.rhts.bos.redhat.com https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56563 -- working https://rhts.redhat.com/cgi-bin/rhts/jobs.cgi?id=56282 -- corrupted Even on the successful run. It has this message. end_request: I/O error, dev xvda, sector 7555901 Buffer I/O error on device dm-0, logical block 918334 lost page write due to I/O error on dm-0 https://rhts.redhat.com/testlogs/56563/189640/1586010/guest-rhel4u7_i386_pv.log Dose it sound like we are still getting corruption, but it does not take out the guest all the time?
I have run additional 10 same tests on one of those affected machines, but apart from 2 jobs were aborted seems due to 4 guests could not talk to the RHTS scheduler, I can't trigger the problem any more. I'll change the severity/priority to low/low due to the unreproducible, but you can close it if feel more appropriate.
(In reply to comment #2) > I have run additional 10 same tests on one of those affected machines, but Correction -- on both of those affected machines (5 each).
I couldn't reproduce this either, OTOH I got this: Badness in local_bh_enable at kernel/softirq.c:141 [<c01213a8>] local_bh_enable+0x47/0x6f [<c0217db9>] skb_checksum+0x133/0x25e [<c025160a>] udp_poll+0x66/0x113 [<c0213ba9>] sock_poll+0x19/0x1d [<c016d636>] do_select+0x190/0x2c7 [<c016d345>] __pollwait+0x0/0x9b [<c0144d68>] __kmalloc+0x56/0xd3 [<c016da6c>] sys_select+0x2e7/0x45c [<c010740f>] syscall_call+0x7/0xb with RH5.2 dom0 and RH4.7.z guest (more or less random, but happens often when running up2date) -- unrelated though.
The badness in local_bh_enable is fixed in RHEL 4.8 (commit 45f38c).
Can't reproduce it. Will re-open it when see it again.