Bug 200127

Summary:	Guests unable to install successfully
Product:	[Fedora] Fedora	Reporter:	Jeremy Katz <katzj>
Component:	xen	Assignee:	Xen Maintainance List <xen-maint>
Status:	CLOSED CURRENTRELEASE	QA Contact:	Brian Brock <bbrock>
Severity:	medium	Docs Contact:
Priority:	medium
Version:	rawhide	CC:	bbrock, bstein, clalance, herbert.xu, katzj, markmc, sct
Target Milestone:	---
Target Release:	---
Hardware:	All
OS:	Linux
Whiteboard:
Fixed In Version:	kernel-2.6.17-1.2488.fc6	Doc Type:	Bug Fix
Doc Text:		Story Points:	---
Clone Of:		Environment:
Last Closed:	2007-03-09 14:18:23 UTC	Type:	---
Regression:	---	Mount Type:	---
Documentation:	---	CRM:
Verified Versions:		Category:	---
oVirt Team:	---	RHEL 7.3 requirements from Atomic Host:
Cloudforms Team:	---	Target Upstream Version:
Embargoed:
Bug Depends On:
Bug Blocks:	200124

Description Jeremy Katz 2006-07-25 16:34:31 UTC

If you have an earlier HV/dom0 installed (2.6.16-1.3001_FC5xen0 on x86_64 in
this case... it's been seen with x86 PAE as well) and start a guest install of
current rawhide, the install later aborts due to IO errors.  dmesg from within
the guest
<4>end_request: I/O error, dev xvda, sector 213205
<4>end_request: I/O error, dev xvda, sector 213473
<4>end_request: I/O error, dev xvda, sector 213631
<4>end_request: I/O error, dev xvda, sector 213675
<4>end_request: I/O error, dev xvda, sector 213701
<3>Aborting journal on device dm-0.
<2>ext3_abort called.
<2>EXT3-fs error (device dm-0): ext3_journal_start_sb: Detected aborted journal
<2>Remounting filesystem read-only
<4>__journal_remove_journal_head: freeing b_committed_data


This is with file backed IO -- I'll try to get a block device attached that I
can use instead to try and see if that works or fails similarly just to help
narrow things down.

Comment 1 Jeremy Katz 2006-07-25 17:26:58 UTC

The same error persists with a block device backed VBD

Comment 2 Jeremy Katz 2006-07-27 19:49:29 UTC

This persists when doing a guest install with 2462 as dom0 and domU

Comment 3 Stephen Tweedie 2006-07-27 21:38:03 UTC

Reproduced.  I've also had a guest that was running an ancient rawhide with 2462
kernels in dom0 and domU, updating all the way to current rawhide --- 800
packages or so to update.  It did so fine without any IO errors.

So disk IO on this kernel is not broken per se; it's just inside anaconda that
it's failing.  (Same domU/dom0 kernel version in each case; lvm-backed domain in
each case, too.)

I have taken a full dmesg log from the fault in my own case; there is nothing to
indicate any problem except for the EIO itself:

<4>end_request: I/O error, dev xvda, sector 211149

errors are the only sign of any problems, all on the /boot LVM partition (which
also matches what Jeremy reported.)  All other errors seen are expected
consequences of that initial error.  Nothing shows up in any dom0 log files.

Comment 4 Stephen Tweedie 2006-07-27 21:44:08 UTC

Correction, error shows up in the _root_ filesystem (dm-0), not boot.

Comment 5 Jeremy Katz 2006-07-30 03:36:48 UTC

*** Bug 200648 has been marked as a duplicate of this bug. ***

Comment 6 Chris Lalancette 2006-07-30 13:52:13 UTC

So, I did some looking at this.  Since I was unable to reproduce at will, Jeremy
gave me access to a box that did it all of the time (running the latest CVS as
of Friday).  To try to debug, I did some instrumentation on the dom0 kernel. 
Basically, I just put a bunch of printk's in the error paths for the blkback
side of things.  When the domU install fails, here's what I saw out of my printks:

Invalid number of sectors: last_sect 8, nsec 2

I put this printk in drivers/xen/blkback/blkback.c, on line 381.   This
basically kicks off this error path when the last sector in the request is >=
PAGE_SIZE >> 9 (meaning 8), or when the number of sectors <=0 (which is not the
case, given the printout).  Because we go into this error path, the blkback
fails the I/O (meaning it returns BLKIF_RESP_ERROR to the ring buffer), and the
domU then fails the I/O, leading to the message we see in the install.

It seems to me that somehow the domU is asking for more 512-byte sectors than
will fit in a page, so the dom0 has to fail the request.  What I don't quite
understand yet is why this is only seen in the installer and not during other
heavy I/O.  I'll do more investigation Monday.

Chris Lalancette

Comment 7 Mark McLoughlin 2006-07-31 22:44:46 UTC

Okay, a bit of an update

Installs work fine using ext2 for / ... ext3 is what triggers the problem

It looks like blkfront/blkback is barfing on non-sector-aligned buffers which
jbd is passing down to it. In fs/jbd/transaction.c:do_get_write_access() the
buffers in question are being allocated here:

    frozen_buffer = jbd_kmalloc(jh2bh(jh)->b_size,
                                 GFP_NOFS);

These are all jbd metadata buffers (jh->b_jlist == BJ_Metadata)

drivers/xen/vbd.c has:

        /* Each segment in a request is up to an aligned page in size. */
        blk_queue_segment_boundary(rq, PAGE_SIZE - 1);
        blk_queue_max_segment_size(rq, PAGE_SIZE);

So, we think the generic block layer should be fixing up these buffers somewhere

We can't see anything obvious in any of these areas that has changed recently

Comment 8 Herbert Xu 2006-08-01 00:51:37 UTC

Turns out that jbd is relying on kmalloc(1024) to return 1024-byte aligned
memory (or at least memory that's 1024 bytes away from a page boundary) which is
false when slab debugging is enabled.

Comment 9 Mark McLoughlin 2006-08-01 06:59:51 UTC

Should be fixed with kernel-2.6.17-1.2488.fc6

I've logged #200873 to track the real fix needed so we can switch
CONFIG_DEBUG_SLAB back on for Xen