Bug 440506
Summary: | panic in aoe:aoecmd_ata_rsp during direct I/O to lvm [snap,mirror,stripe] | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Product: | Red Hat Enterprise Linux 5 | Reporter: | Corey Marthaler <cmarthal> | ||||||||
Component: | kernel | Assignee: | Tom Coughlan <coughlan> | ||||||||
Status: | CLOSED ERRATA | QA Contact: | Martin Jenner <mjenner> | ||||||||
Severity: | urgent | Docs Contact: | |||||||||
Priority: | high | ||||||||||
Version: | 5.2 | CC: | coughlan, ed.cashin, syeghiay | ||||||||
Target Milestone: | beta | Keywords: | TestBlocker | ||||||||
Target Release: | --- | ||||||||||
Hardware: | All | ||||||||||
OS: | Linux | ||||||||||
Whiteboard: | |||||||||||
Fixed In Version: | Doc Type: | Bug Fix | |||||||||
Doc Text: | Story Points: | --- | |||||||||
Clone Of: | Environment: | ||||||||||
Last Closed: | 2009-01-20 19:48:45 UTC | Type: | --- | ||||||||
Regression: | --- | Mount Type: | --- | ||||||||
Documentation: | --- | CRM: | |||||||||
Verified Versions: | Category: | --- | |||||||||
oVirt Team: | --- | RHEL 7.3 requirements from Atomic Host: | |||||||||
Cloudforms Team: | --- | Target Upstream Version: | |||||||||
Embargoed: | |||||||||||
Bug Depends On: | |||||||||||
Bug Blocks: | 448732 | ||||||||||
Attachments: |
|
Description
Corey Marthaler
2008-04-03 20:52:31 UTC
This is reproducable. I hit this while running single machine lvm mirror block level I/O. This smells like a regression and potentially a pretty big issue if we support aoe in rhel5.2. This bugzilla has Keywords: Regression. Since no regressions are allowed between releases, it is also being proposed as a blocker for this release. Please resolve ASAP. I've tried this with lvm stripes as well and that also causes the panic. So this is somehow related to multiple device aoe lvm volumes. I/O to snap, mirrors, and stripes causes the panics, but linears are fine. Here is the I/O I was running: b_iogen -o -m random -f direct -i 500 -s write,writev -t1000b -T10000b -d /dev/hayes/lvol0 | b_doio -vD I've also attempted this with a stripe, mirror, and a snapshot on the system but as long as I only wrote to the linear, I was fine, as soon as I wrote to the snap/mirror/stripe, panic. Just a note that this is still occuring on 2.6.18-92.el5. [This bugzilla post follows a similar direct email.] I would like to help resolve the issue that you are seeing when using AoE and LVM striping/snapshotting together. Can you please check whether the same problem is present with the aoe6-62 driver from the Coraid website? http://www.coraid.com/support/linux/ I understand that RHEL might prefer to use the aoe driver in the 2.6.18 kernel, but knowing whether the current aoe driver exhibits the same behavior will help me to identify any bug. Could you please provide me with commands that I can run to reproduce the panics you are seeing? I see one listed above, but it looks like recently you were able to cause the problem to manifest more easily and consistently. I would like to know something about the kind of AoE target you have. Tom Coughlan mentions that it is a Coraid AoE storage box. Could you please provide the "sos" output from that box? I would like to know something about the kind of AoE target you have. Tom Coughlan mentions that it is a Coraid AoE storage box. Could you please provide the "sos" output from that box? You can email it to me, since it is more than a screenful. If you have questions about using CEC or the serial console, you can email support. Information from the AoE initiator would complete the picture. Can you please send the file resulting from a run of the "sos-linux" script? It is available at the following URL. http://www.coraid.com/support/linux/sos-linux I can try to replicate the problem in our lab here, but if you are willing to test out patches on your system, I could send them to you in order first to diagnose and then to fix the problem. I appreciate the work you have done in characterizing this problem. I am also glad that Tom Coughlan brought this issue to my attention. Thank you both. I have verified that this issue is fixed with the latest aoe driver (v6.2). OK. If the aoe6-62 driver doesn't have this panic, will RHEL use the aoe6-62 driver, or should I attempt to find and backport the fix? We do regularly push updates upstream to kernel.org, but I am running behind on the latest push. In other words, I know that if RHEL puts aoe6-62 in RHEL now, the upstream will catch up, but I cannot say when. That is a good question. Tom, how do we get the aoe6-62 driver into RHEL5 asap? (In reply to comment #11) > OK. If the aoe6-62 driver doesn't have this panic, will > RHEL use the aoe6-62 driver, or should I attempt to find > and backport the fix? The highest priority at this stage in RHEL 5 is to avoid regressions. So, ideally, you would find and backport the specific fix. We have some lattitude here, though. If you can make a convincing case that the risk of a larger update is low and the benefit is large, we can look at it. I would not be in favor of shipping a version of the driver in RHEL before it has gotten some significant review and testing upstream, and Fedora. Please take a look at the diff between 5.2 and recent driver versions. If you can isolate the fix, that would be great. If not, suggest a driver version that has had some upstream exposure, and the smallest amount of change that is likely to have the fix. Then maybe Corey can test that and see if it has the fix. The diff between the aoe driver in 2.6.18, which is aoe6-22, and the one that is aoe6-62, is huge. Besides bug fixes, there have been many new features added. To identify and backport the fix, I would need to be able to replicate the problem or to work very closely with Corey Marthaler. For replicating the problem, I just need to know the software versions involved and the most simple commands that trigger a panic. For working with C.M., * I would provide patches to C.M.'s kernel sources, * C.M. would apply the patches and build a modified aoe driver, * C.M. would install modified aoe driver, and run the commands, * C.M. would send me kernel messages, e.g., from netconsole, * I would evaluate the gathered information, ... and then we'd repeat with the next round of patches. If this loop can iterate quickly, it should not take very long to identify and backport the fix. The commands that I ran for this can be boiled down pretty easily. 1. Create one of the following with your aoe devices (an lvm snapshot/mirror/stripe). For a snapshot: # pvcreate /dev/etherd/e1p[123] # vgcreate vg /dev/etherd/e1p[123] # lvcreate -L 4G -n origin vg # lvcreate -s vg/origin -L 1G -n snap 2. Run some kind of block level I/O to that snap volume (dev/vg/snap). I used our tool b_iogen/b_doio, but I assume a dd would work as well 3. That's it, you should have triggered that panic. Can you please confirm that the command below can trigger a panic? dd if=/dev/vg/snap of=/dev/null bs=1M count=1000 Also, could you please email me the file that results when you run this sos-linux script, http://www.coraid.com/support/linux/sos-linux ? I would like to have more specific information about your system in case I have trouble reproducing your problem. My initial attempts in a VMware instance running a RHEL clone are not causing a panic. I'm working on getting RHEL set up for testing. Also, does the panic only occur when you have created partitions on the aoe device(s)? What kind of partition table are you using---fdisk or GPT? We've narrowed this down to direct I/O. I can't reproduce this using dd (even with the iflag=direct). However, here is a brain dead program that only does a direct read. It causes the panic every time. Also, our aoe device had been partitioned using gpt labels Created attachment 310391 [details]
program to repo this panic on v22
Created attachment 310392 [details]
Here is the output file you requested
Created attachment 311070 [details] aoe: use bio->bi_idx to access biovecs The attached patch causes the aoe driver to use the bio's bi_idx field when accessing the biovecs. The test case from Corey Marthaler panics consistently without this patch, but the change in the patch eliminates the panic. This patch was created using the standalone aoe driver, (also version aoe6-22) from the Coraid website. To use it with the standalone driver requires that the second argument to skb_linearize be removed as it is in the RHEL 5.2 kernel sources. With a "-p2" level, the patch is expected to apply cleanly to the RHEL kernel sources. Just in case the Mac I'm using does something strange to the patch, I've made it available here, as well: http://noserose.net/e/temp/aoe6-22-22i.diff I should have asked: Please try out the patch, "aoe: use bio->bi_idx to access biovecs", and let me know how it works for you as soon as you can. I understand there's a RHEL deadline coming up at the end of this month, when I expect to be quite busy. (In reply to comment #23) > I understand there's a RHEL deadline coming up > at the end of this month, when I expect to be quite > busy. Thanks for isolating the patch. The RHEL 5.2 deadline was quite a while ago, and we are just beginning development on 5.3, so we have a while. The BZ was marked urgent because it was thought to be a regression in 5.2. I'm not sure that is true, since the driver did not change in 5.2. I'll request Corey test this by setting NEEDINFO. (The BZ should not be in the VERIFIED state, anyway, because the patch is not in RHEL 5 yet.) I'll also ask Chip to handle this from here. :) Tom Thank you. Yes, I thought it was odd that it was being called a regression, since there were no new changes. Please let me know if I can be of further assistance. After once again reproducing this bz on 2.6.18-92.el5, I was unable to reproduce it on the newly built kern with the fix in it (2.6.18-105.el5.bz440506). (In reply to comment #22) > Created an attachment (id=311070) [details] > aoe: use bio->bi_idx to access biovecs Ed, Why did you do it this way - buf->bv = buf->bio->bi_io_vec; + buf->bv = buf->bio->bi_io_vec + buf->bio->bi_idx; rather than they way it is done upstream - buf->bv = buf->bio->bi_io_vec; + buf->bv = &bio->bi_io_vec[bio->bi_idx]; ? Tom I think I just saw what needed to be done, did it, tested it, and only later noticed that I had used a different idiom in the past, but the two versions are identical. in kernel-2.6.18-109.el5 You can download this test kernel from http://people.redhat.com/dzickus/el5 An advisory has been issued which should help the problem described in this bug report. This report is therefore being closed with a resolution of ERRATA. For more information on therefore solution and/or where to find the updated files, please follow the link below. You may reopen this bug report if the solution does not work for you. http://rhn.redhat.com/errata/RHSA-2009-0225.html |