Red Hat Bugzilla – Bug 200127
Guests unable to install successfully
Last modified: 2007-11-30 17:11:38 EST
If you have an earlier HV/dom0 installed (2.6.16-1.3001_FC5xen0 on x86_64 in
this case... it's been seen with x86 PAE as well) and start a guest install of
current rawhide, the install later aborts due to IO errors. dmesg from within
<4>end_request: I/O error, dev xvda, sector 213205
<4>end_request: I/O error, dev xvda, sector 213473
<4>end_request: I/O error, dev xvda, sector 213631
<4>end_request: I/O error, dev xvda, sector 213675
<4>end_request: I/O error, dev xvda, sector 213701
<3>Aborting journal on device dm-0.
<2>EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
<2>Remounting filesystem read-only
<4>__journal_remove_journal_head: freeing b_committed_data
This is with file backed IO -- I'll try to get a block device attached that I
can use instead to try and see if that works or fails similarly just to help
narrow things down.
The same error persists with a block device backed VBD
This persists when doing a guest install with 2462 as dom0 and domU
Reproduced. I've also had a guest that was running an ancient rawhide with 2462
kernels in dom0 and domU, updating all the way to current rawhide --- 800
packages or so to update. It did so fine without any IO errors.
So disk IO on this kernel is not broken per se; it's just inside anaconda that
it's failing. (Same domU/dom0 kernel version in each case; lvm-backed domain in
each case, too.)
I have taken a full dmesg log from the fault in my own case; there is nothing to
indicate any problem except for the EIO itself:
<4>end_request: I/O error, dev xvda, sector 211149
errors are the only sign of any problems, all on the /boot LVM partition (which
also matches what Jeremy reported.) All other errors seen are expected
consequences of that initial error. Nothing shows up in any dom0 log files.
Correction, error shows up in the _root_ filesystem (dm-0), not boot.
*** Bug 200648 has been marked as a duplicate of this bug. ***
So, I did some looking at this. Since I was unable to reproduce at will, Jeremy
gave me access to a box that did it all of the time (running the latest CVS as
of Friday). To try to debug, I did some instrumentation on the dom0 kernel.
Basically, I just put a bunch of printk's in the error paths for the blkback
side of things. When the domU install fails, here's what I saw out of my printks:
Invalid number of sectors: last_sect 8, nsec 2
I put this printk in drivers/xen/blkback/blkback.c, on line 381. This
basically kicks off this error path when the last sector in the request is >=
PAGE_SIZE >> 9 (meaning 8), or when the number of sectors <=0 (which is not the
case, given the printout). Because we go into this error path, the blkback
fails the I/O (meaning it returns BLKIF_RESP_ERROR to the ring buffer), and the
domU then fails the I/O, leading to the message we see in the install.
It seems to me that somehow the domU is asking for more 512-byte sectors than
will fit in a page, so the dom0 has to fail the request. What I don't quite
understand yet is why this is only seen in the installer and not during other
heavy I/O. I'll do more investigation Monday.
Okay, a bit of an update
Installs work fine using ext2 for / ... ext3 is what triggers the problem
It looks like blkfront/blkback is barfing on non-sector-aligned buffers which
jbd is passing down to it. In fs/jbd/transaction.c:do_get_write_access() the
buffers in question are being allocated here:
frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size,
These are all jbd metadata buffers (jh->b_jlist == BJ_Metadata)
/* Each segment in a request is up to an aligned page in size. */
blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
So, we think the generic block layer should be fixing up these buffers somewhere
We can't see anything obvious in any of these areas that has changed recently
Turns out that jbd is relying on kmalloc(1024) to return 1024-byte aligned
memory (or at least memory that's 1024 bytes away from a page boundary) which is
false when slab debugging is enabled.
Should be fixed with kernel-2.6.17-1.2488.fc6
I've logged #200873 to track the real fix needed so we can switch
CONFIG_DEBUG_SLAB back on for Xen