Bug 200127
Summary: | Guests unable to install successfully | ||
---|---|---|---|
Product: | [Fedora] Fedora | Reporter: | Jeremy Katz <katzj> |
Component: | xen | Assignee: | Xen Maintainance List <xen-maint> |
Status: | CLOSED CURRENTRELEASE | QA Contact: | Brian Brock <bbrock> |
Severity: | medium | Docs Contact: | |
Priority: | medium | ||
Version: | rawhide | CC: | bbrock, bstein, clalance, herbert.xu, katzj, markmc, sct |
Target Milestone: | --- | ||
Target Release: | --- | ||
Hardware: | All | ||
OS: | Linux | ||
Whiteboard: | |||
Fixed In Version: | kernel-2.6.17-1.2488.fc6 | Doc Type: | Bug Fix |
Doc Text: | Story Points: | --- | |
Clone Of: | Environment: | ||
Last Closed: | 2007-03-09 14:18:23 UTC | Type: | --- |
Regression: | --- | Mount Type: | --- |
Documentation: | --- | CRM: | |
Verified Versions: | Category: | --- | |
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |
Cloudforms Team: | --- | Target Upstream Version: | |
Embargoed: | |||
Bug Depends On: | |||
Bug Blocks: | 200124 |
Description
Jeremy Katz
2006-07-25 16:34:31 UTC
The same error persists with a block device backed VBD This persists when doing a guest install with 2462 as dom0 and domU Reproduced. I've also had a guest that was running an ancient rawhide with 2462 kernels in dom0 and domU, updating all the way to current rawhide --- 800 packages or so to update. It did so fine without any IO errors. So disk IO on this kernel is not broken per se; it's just inside anaconda that it's failing. (Same domU/dom0 kernel version in each case; lvm-backed domain in each case, too.) I have taken a full dmesg log from the fault in my own case; there is nothing to indicate any problem except for the EIO itself: <4>end_request: I/O error, dev xvda, sector 211149 errors are the only sign of any problems, all on the /boot LVM partition (which also matches what Jeremy reported.) All other errors seen are expected consequences of that initial error. Nothing shows up in any dom0 log files. Correction, error shows up in the _root_ filesystem (dm-0), not boot. *** Bug 200648 has been marked as a duplicate of this bug. *** So, I did some looking at this. Since I was unable to reproduce at will, Jeremy gave me access to a box that did it all of the time (running the latest CVS as of Friday). To try to debug, I did some instrumentation on the dom0 kernel. Basically, I just put a bunch of printk's in the error paths for the blkback side of things. When the domU install fails, here's what I saw out of my printks: Invalid number of sectors: last_sect 8, nsec 2 I put this printk in drivers/xen/blkback/blkback.c, on line 381. This basically kicks off this error path when the last sector in the request is >= PAGE_SIZE >> 9 (meaning 8), or when the number of sectors <=0 (which is not the case, given the printout). Because we go into this error path, the blkback fails the I/O (meaning it returns BLKIF_RESP_ERROR to the ring buffer), and the domU then fails the I/O, leading to the message we see in the install. It seems to me that somehow the domU is asking for more 512-byte sectors than will fit in a page, so the dom0 has to fail the request. What I don't quite understand yet is why this is only seen in the installer and not during other heavy I/O. I'll do more investigation Monday. Chris Lalancette Okay, a bit of an update Installs work fine using ext2 for / ... ext3 is what triggers the problem It looks like blkfront/blkback is barfing on non-sector-aligned buffers which jbd is passing down to it. In fs/jbd/transaction.c:do_get_write_access() the buffers in question are being allocated here: frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size, GFP_NOFS); These are all jbd metadata buffers (jh->b_jlist == BJ_Metadata) drivers/xen/vbd.c has: /* Each segment in a request is up to an aligned page in size. */ blk_queue_segment_boundary(rq, PAGE_SIZE - 1); blk_queue_max_segment_size(rq, PAGE_SIZE); So, we think the generic block layer should be fixing up these buffers somewhere We can't see anything obvious in any of these areas that has changed recently Turns out that jbd is relying on kmalloc(1024) to return 1024-byte aligned memory (or at least memory that's 1024 bytes away from a page boundary) which is false when slab debugging is enabled. Should be fixed with kernel-2.6.17-1.2488.fc6 I've logged #200873 to track the real fix needed so we can switch CONFIG_DEBUG_SLAB back on for Xen |