Red Hat Bugzilla – Full Text Bug Listing
|Summary:||Guests unable to install successfully|
|Product:||[Fedora] Fedora||Reporter:||Jeremy Katz <katzj>|
|Component:||xen||Assignee:||Xen Maintainance List <xen-maint>|
|Status:||CLOSED CURRENTRELEASE||QA Contact:||Brian Brock <bbrock>|
|Version:||rawhide||CC:||bbrock, bstein, clalance, herbert.xu, katzj, markmc, sct|
|Fixed In Version:||kernel-2.6.17-1.2488.fc6||Doc Type:||Bug Fix|
|Doc Text:||Story Points:||---|
|Last Closed:||2007-03-09 09:18:23 EST||Type:||---|
|oVirt Team:||---||RHEL 7.3 requirements from Atomic Host:|
|Bug Depends On:|
Description Jeremy Katz 2006-07-25 12:34:31 EDT
If you have an earlier HV/dom0 installed (2.6.16-1.3001_FC5xen0 on x86_64 in this case... it's been seen with x86 PAE as well) and start a guest install of current rawhide, the install later aborts due to IO errors. dmesg from within the guest <4>end_request: I/O error, dev xvda, sector 213205 <4>end_request: I/O error, dev xvda, sector 213473 <4>end_request: I/O error, dev xvda, sector 213631 <4>end_request: I/O error, dev xvda, sector 213675 <4>end_request: I/O error, dev xvda, sector 213701 <3>Aborting journal on device dm-0. <2>ext3_abort called. <2>EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal <2>Remounting filesystem read-only <4>__journal_remove_journal_head: freeing b_committed_data This is with file backed IO -- I'll try to get a block device attached that I can use instead to try and see if that works or fails similarly just to help narrow things down.
Comment 1 Jeremy Katz 2006-07-25 13:26:58 EDT
The same error persists with a block device backed VBD
Comment 2 Jeremy Katz 2006-07-27 15:49:29 EDT
This persists when doing a guest install with 2462 as dom0 and domU
Comment 3 Stephen Tweedie 2006-07-27 17:38:03 EDT
Reproduced. I've also had a guest that was running an ancient rawhide with 2462 kernels in dom0 and domU, updating all the way to current rawhide --- 800 packages or so to update. It did so fine without any IO errors. So disk IO on this kernel is not broken per se; it's just inside anaconda that it's failing. (Same domU/dom0 kernel version in each case; lvm-backed domain in each case, too.) I have taken a full dmesg log from the fault in my own case; there is nothing to indicate any problem except for the EIO itself: <4>end_request: I/O error, dev xvda, sector 211149 errors are the only sign of any problems, all on the /boot LVM partition (which also matches what Jeremy reported.) All other errors seen are expected consequences of that initial error. Nothing shows up in any dom0 log files.
Comment 4 Stephen Tweedie 2006-07-27 17:44:08 EDT
Correction, error shows up in the _root_ filesystem (dm-0), not boot.
Comment 5 Jeremy Katz 2006-07-29 23:36:48 EDT
*** Bug 200648 has been marked as a duplicate of this bug. ***
Comment 6 Chris Lalancette 2006-07-30 09:52:13 EDT
So, I did some looking at this. Since I was unable to reproduce at will, Jeremy gave me access to a box that did it all of the time (running the latest CVS as of Friday). To try to debug, I did some instrumentation on the dom0 kernel. Basically, I just put a bunch of printk's in the error paths for the blkback side of things. When the domU install fails, here's what I saw out of my printks: Invalid number of sectors: last_sect 8, nsec 2 I put this printk in drivers/xen/blkback/blkback.c, on line 381. This basically kicks off this error path when the last sector in the request is >= PAGE_SIZE >> 9 (meaning 8), or when the number of sectors <=0 (which is not the case, given the printout). Because we go into this error path, the blkback fails the I/O (meaning it returns BLKIF_RESP_ERROR to the ring buffer), and the domU then fails the I/O, leading to the message we see in the install. It seems to me that somehow the domU is asking for more 512-byte sectors than will fit in a page, so the dom0 has to fail the request. What I don't quite understand yet is why this is only seen in the installer and not during other heavy I/O. I'll do more investigation Monday. Chris Lalancette
Comment 7 Mark McLoughlin 2006-07-31 18:44:46 EDT
Okay, a bit of an update Installs work fine using ext2 for / ... ext3 is what triggers the problem It looks like blkfront/blkback is barfing on non-sector-aligned buffers which jbd is passing down to it. In fs/jbd/transaction.c:do_get_write_access() the buffers in question are being allocated here: frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size, GFP_NOFS); These are all jbd metadata buffers (jh->b_jlist == BJ_Metadata) drivers/xen/vbd.c has: /* Each segment in a request is up to an aligned page in size. */ blk_queue_segment_boundary(rq, PAGE_SIZE - 1); blk_queue_max_segment_size(rq, PAGE_SIZE); So, we think the generic block layer should be fixing up these buffers somewhere We can't see anything obvious in any of these areas that has changed recently
Comment 8 Herbert Xu 2006-07-31 20:51:37 EDT
Turns out that jbd is relying on kmalloc(1024) to return 1024-byte aligned memory (or at least memory that's 1024 bytes away from a page boundary) which is false when slab debugging is enabled.
Comment 9 Mark McLoughlin 2006-08-01 02:59:51 EDT
Should be fixed with kernel-2.6.17-1.2488.fc6 I've logged #200873 to track the real fix needed so we can switch CONFIG_DEBUG_SLAB back on for Xen