Bug 223947
Summary: | raid10_make_request bug: can't convert block across chunks or bigger than 64k.. | ||||||||
---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Daniel Riek <riek> | ||||||
Component: | kernel-xen | Assignee: | Mikuláš Patočka <mpatocka> | ||||||
Status: | CLOSED ERRATA | QA Contact: | |||||||
Severity: | high | Docs Contact: | |||||||
Priority: | high | ||||||||
Version: | 5.0 | CC: | agk, alex, ask, cevich, coughlan, cpaul, cward, dennisml, dledford, dzickus, ehabkost, info, jturner, kernel-mgr, mbroz, mpatocka, mpoole, scott_purcell, sct, sean, sputhenp, syeghiay, tao, tis, xen-maint | ||||||
Target Milestone: | --- | ||||||||
Target Release: | --- | ||||||||
Hardware: | All | ||||||||
OS: | Linux | ||||||||
Whiteboard: | |||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||
Doc Text: | Story Points: | --- | |||||||
Clone Of: | Environment: | ||||||||
Last Closed: | 2009-09-02 08:31:47 UTC | Type: | --- | ||||||
Regression: | --- | Mount Type: | --- | ||||||
Documentation: | --- | CRM: | |||||||
Verified Versions: | Category: | --- | |||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||
Embargoed: | |||||||||
Bug Depends On: | |||||||||
Bug Blocks: | 492568 | ||||||||
Attachments: |
|
Description
Daniel Riek
2007-01-23 08:03:36 UTC
This request was evaluated by Red Hat Product Management for inclusion in a Red Hat Enterprise Linux major release. Product Management has requested further review of this request by Red Hat Engineering, for potential inclusion in a Red Hat Enterprise Linux Major release. This request is not yet committed for inclusion. The problem does also not exist when tap:aio is used instead of phy to access the lvm2 volume on the backend. virt-install though does not currently support installation in that mode. At this point we are not sure if we can have a non-intrusive fix for this. The problem might be in one or more of several layers as ext3, block-frontend, block-backend, lvm and md are layered ontop of each other. The tap:aio workaround has not seen a whole lot of testing for lvm storage (it is generally only used for files) and is not suited for a default setting. So my recommendations is: - Have QE verify if the problem only exists for lvm ontop of md raid10 (vs. raid5, etc.). - Releasenote the issue in 5.0 as a known not working. - Fix in 5.1 Moving to 5.1 and cloning release-note bug for 5.0 Bug #224077 track the release note for 5.0 The only possibly-related patch outstanding in the dm area of the code I can think of is this one to preserve max_hw_sectors: http://www.kernel.org/pub/linux/kernel/people/agk/patches/2.6/editing/dm-merge-max_hw_sector.patch (volatile URL) Bug #245681 reports similar messages and is probably the same bug, but on Fedora 7. This request was previously evaluated by Red Hat Product Management for inclusion in the current Red Hat Enterprise Linux release, but Red Hat was unable to resolve it in time. This request will be reviewed for a future Red Hat Enterprise Linux release. I still had no chance to re-create the test environment (at home). What about the customer who reported the same issue? This request was evaluated by Red Hat Product Management for inclusion, but this component is not scheduled to be updated in the current Red Hat Enterprise Linux release. If you would like this request to be reviewed for the next minor release, ask your support representative to set the next rhel-x.y flag to "?". Is this still a problem for the customer who originally reported the issue? Chris Lalancette I haven't tried for a few months, but last I tried it was still happening. FWIW, someone in the comments here reports that it's still a problem - I am guessing in $recent_version: http://strugglers.net/~andy/blog/2008/01/20/red-hat-based-linux-under-xen-from-debian-etch/ Running across this problem on RHEL 5.2 x86_64 host with RHEL 5.2 x86_64 guests. Though it doesn't appear to be fatal as the installation is continuing. The disks are 4x 160g SATA drives in an md RAID10 w/ 1024k chunks, LVM2 on top of that w/ 32mb PE, 1 LV per guests for their storage. Naturally my error is: raid10_make_request bug: can't convert block across chunks or bigger than 64k.. # cat /proc/mdstat Personalities : [raid10] [raid1] md0 : active raid1 sda1[0] sdb1[1] sdc1[2] sdd1[3] 256896 blocks [4/4] [UUUU] md1 : active raid10 sda3[0] sdd3[3] sdc3[2] sdb3[1] 311547904 blocks 1024K chunks 2 near-copies [4/4] [UUUU] unused devices: <none> # mdadm --detail /dev/md1 /dev/md1: Version : 00.90.03 Creation Time : Mon Jun 16 14:21:57 2008 Raid Level : raid10 Array Size : 311547904 (297.12 GiB 319.03 GB) Used Dev Size : 155773952 (148.56 GiB 159.51 GB) Raid Devices : 4 Total Devices : 4 Preferred Minor : 1 Persistence : Superblock is persistent Update Time : Thu Jul 31 13:00:53 2008 State : active Active Devices : 4 Working Devices : 4 Failed Devices : 0 Spare Devices : 0 Layout : near=2, far=1 Chunk Size : 1024K UUID : 0195a9b8:f2a3ec17:10f8a276:34054b89 Events : 0.42399 Number Major Minor RaidDevice State 0 8 3 0 active sync /dev/sda3 1 8 19 1 active sync /dev/sdb3 2 8 35 2 active sync /dev/sdc3 3 8 51 3 active sync /dev/sdd3 (In reply to comment #16) Errr, scratch that, copy-pasted the wrong error, my error is: kernel: raid10_make_request bug: can't convert block across chunks or bigger than 1024k 504123387 4 Created attachment 314441 [details]
anaconda log
I just tried again, and at least with my setup I get a fatal anaconda error after the raid10_make_request errors.
I'm attaching my anaconda log.
See also #224077 We are experiencing this bug with RHEL 5.2 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858107 3 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858107 3 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 109050879 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 109050879 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858111 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858111 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858107 3 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858235 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858107 3 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858235 4 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858107 3 kernel: raid0_make_request bug: can't convert block across chunks or bigger than 64k 104858235 4 Getting kind of desperate for some idea of what it actually means, as we are paying to have the machine in colo but are unsure whether this is destructive or not, so we aren't game to deploy much to it. We seem to have resolved this by emptying and deleting the LVs, removing the underlying RAID0 device from the VG, splitting the RAID10 device back to two RAID1 devices, then adding those as PVs to the VG. From our experience, this bug is caused by some interaction between LVM and a RAID10 device. *** Bug 440093 has been marked as a duplicate of this bug. *** http://thread.gmane.org/gmane.linux.kernel/562845 This huge patchset seem to be related to this issue I ran into this problem too. From what I gather this makes at least software-raid 10 unusable in a lot of deployments. I was planning to use a raid 10 setup for virtualisation to improve shared i/o performance over a regular raid 1 setup of the domU's but with this bug that seems to be impossible right now. Is there a known workaround? I can confirm this in RHEL5.3 (latest packages), 64 bit Xen Kernel on a Sun x4540 ("Thumper"). Obviously, running 24 separate instances of RAID1 is a bit tedious. I saw both the errors in boot ("Can't write to /boot") and the problem with pygrub. I have seen this on two separate machines. (In reply to comment #28) > with this bug that seems to be impossible right now. Is there a known > workaround? What I do is just use "file based" disks that are stored on raid10 backed partition. In my tests I lost a little bit of performance but not too much (I find the management of the files much more tedious than doing it through LVM). Created attachment 342638 [details]
A proposed patch and an explanation of the bug
I've got an idea how to fix it. The explanation of the bug is in the patch header. Please, test this patch. Try first that that bug happens, that you can reproduce it well, then apply the patch and try installation inside xen again. The patch needs to be applied only on dom0 kernel, not on the guest kernel. Does anyone reading this think they'll have chance in the next day or two to test that this patch does in fact fix the problem? (We're right up against a deadline here, so even without confirmation at this stage, I'd recommend we switch the bz over to RHEL kernel and include this patch.) I'm about to test this in next few days. I have disks for test system ordered and I should get them tomorrow. Test kernel based on 2.6.18-128.1.10.el5 with proposed patch has already been compiled. And I have anaconda with md raid10 support to install test system. If there's such time pressure, I installed Xen and tested it on my own. I managed to reproduce the bug and get I/O errors on filesystem creation. When I applied this patch to dom0 kernel and used the same configuration to create a virtual machine, it no longer fails and proceeds with the installation. It is still installing, I'll test the installed system tomorrow. I haven't seen any I/O error so far after the patch was applied. I'm going to submit it. Tested patch and it seem to fix the problem. No more I/O errors and virtual machine works. (In reply to comment #36) > I haven't seen any I/O error so far after the patch was applied. I'm going to > submit it. Is this going to appear in a future 5.3 kernel or will this only be available in 5.4/6.0? in kernel-2.6.18-152.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 Please do NOT transition this bugzilla state to VERIFIED until our QE team has sent specific instructions indicating when to do so. However feel free to provide a comment indicating that this fix has been verified. ~~ Attention - RHEL 5.4 Beta Released! ~~ RHEL 5.4 Beta has been released! There should be a fix present in the Beta release that addresses this particular request. Please test and report back results here, at your earliest convenience. RHEL 5.4 General Availability release is just around the corner! If you encounter any issues while testing Beta, please describe the issues you have encountered and set the bug into NEED_INFO. If you encounter new issues, please clone this bug to open a new issue and request it be reviewed for inclusion in RHEL 5.4 or a later update, if it is not of urgent severity. Please do not flip the bug status to VERIFIED. Only post your verification results, and if available, update Verified field with the appropriate value. Questions can be posted to this bug or your customer or partner representative. An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-1243.html Is there a specific reason why this patch is not committed upstream? This patch is upstream, although in a different form (the code has changed since RHEL 5). |