Bug 200127

Summary: Guests unable to install successfully
Product: [Fedora] Fedora Reporter: Jeremy Katz <katzj>
Component: xenAssignee: Xen Maintainance List <xen-maint>
Status: CLOSED CURRENTRELEASE QA Contact: Brian Brock <bbrock>
Severity: medium Docs Contact:
Priority: medium    
Version: rawhideCC: bbrock, bstein, clalance, herbert.xu, katzj, markmc, sct
Target Milestone: ---   
Target Release: ---   
Hardware: All   
OS: Linux   
Fixed In Version: kernel-2.6.17-1.2488.fc6 Doc Type: Bug Fix
Doc Text:
Story Points: ---
Clone Of: Environment:
Last Closed: 2007-03-09 09:18:23 EST Type: ---
Regression: --- Mount Type: ---
Documentation: --- CRM:
Verified Versions: Category: ---
oVirt Team: --- RHEL 7.3 requirements from Atomic Host:
Cloudforms Team: ---
Bug Depends On:    
Bug Blocks: 200124    

Description Jeremy Katz 2006-07-25 12:34:31 EDT
If you have an earlier HV/dom0 installed (2.6.16-1.3001_FC5xen0 on x86_64 in
this case... it's been seen with x86 PAE as well) and start a guest install of
current rawhide, the install later aborts due to IO errors.  dmesg from within
the guest
<4>end_request: I/O error, dev xvda, sector 213205
<4>end_request: I/O error, dev xvda, sector 213473
<4>end_request: I/O error, dev xvda, sector 213631
<4>end_request: I/O error, dev xvda, sector 213675
<4>end_request: I/O error, dev xvda, sector 213701
<3>Aborting journal on device dm-0.
<2>ext3_abort called.
<2>EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
<2>Remounting filesystem read-only
<4>__journal_remove_journal_head: freeing b_committed_data

This is with file backed IO -- I'll try to get a block device attached that I
can use instead to try and see if that works or fails similarly just to help
narrow things down.
Comment 1 Jeremy Katz 2006-07-25 13:26:58 EDT
The same error persists with a block device backed VBD
Comment 2 Jeremy Katz 2006-07-27 15:49:29 EDT
This persists when doing a guest install with 2462 as dom0 and domU
Comment 3 Stephen Tweedie 2006-07-27 17:38:03 EDT
Reproduced.  I've also had a guest that was running an ancient rawhide with 2462
kernels in dom0 and domU, updating all the way to current rawhide --- 800
packages or so to update.  It did so fine without any IO errors.

So disk IO on this kernel is not broken per se; it's just inside anaconda that
it's failing.  (Same domU/dom0 kernel version in each case; lvm-backed domain in
each case, too.)

I have taken a full dmesg log from the fault in my own case; there is nothing to
indicate any problem except for the EIO itself:

<4>end_request: I/O error, dev xvda, sector 211149

errors are the only sign of any problems, all on the /boot LVM partition (which
also matches what Jeremy reported.)  All other errors seen are expected
consequences of that initial error.  Nothing shows up in any dom0 log files.
Comment 4 Stephen Tweedie 2006-07-27 17:44:08 EDT
Correction, error shows up in the _root_ filesystem (dm-0), not boot.
Comment 5 Jeremy Katz 2006-07-29 23:36:48 EDT
*** Bug 200648 has been marked as a duplicate of this bug. ***
Comment 6 Chris Lalancette 2006-07-30 09:52:13 EDT
So, I did some looking at this.  Since I was unable to reproduce at will, Jeremy
gave me access to a box that did it all of the time (running the latest CVS as
of Friday).  To try to debug, I did some instrumentation on the dom0 kernel. 
Basically, I just put a bunch of printk's in the error paths for the blkback
side of things.  When the domU install fails, here's what I saw out of my printks:

Invalid number of sectors: last_sect 8, nsec 2

I put this printk in drivers/xen/blkback/blkback.c, on line 381.   This
basically kicks off this error path when the last sector in the request is >=
PAGE_SIZE >> 9 (meaning 8), or when the number of sectors <=0 (which is not the
case, given the printout).  Because we go into this error path, the blkback
fails the I/O (meaning it returns BLKIF_RESP_ERROR to the ring buffer), and the
domU then fails the I/O, leading to the message we see in the install.

It seems to me that somehow the domU is asking for more 512-byte sectors than
will fit in a page, so the dom0 has to fail the request.  What I don't quite
understand yet is why this is only seen in the installer and not during other
heavy I/O.  I'll do more investigation Monday.

Chris Lalancette
Comment 7 Mark McLoughlin 2006-07-31 18:44:46 EDT
Okay, a bit of an update

Installs work fine using ext2 for / ... ext3 is what triggers the problem

It looks like blkfront/blkback is barfing on non-sector-aligned buffers which
jbd is passing down to it. In fs/jbd/transaction.c:do_get_write_access() the
buffers in question are being allocated here:

    frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size,

These are all jbd metadata buffers (jh->b_jlist == BJ_Metadata)

drivers/xen/vbd.c has:

        /* Each segment in a request is up to an aligned page in size. */
        blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
        blk_queue_max_segment_size(rq, PAGE_SIZE);

So, we think the generic block layer should be fixing up these buffers somewhere

We can't see anything obvious in any of these areas that has changed recently

Comment 8 Herbert Xu 2006-07-31 20:51:37 EDT
Turns out that jbd is relying on kmalloc(1024) to return 1024-byte aligned
memory (or at least memory that's 1024 bytes away from a page boundary) which is
false when slab debugging is enabled.
Comment 9 Mark McLoughlin 2006-08-01 02:59:51 EDT
Should be fixed with kernel-2.6.17-1.2488.fc6

I've logged #200873 to track the real fix needed so we can switch
CONFIG_DEBUG_SLAB back on for Xen